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Overview 


This  is  a  progress  report  for  the  first  year  of  a  three-vear  grant.  It  is  organized  as  follows: 
The  overall  nature  and  scope  of  the  work  is  first  described.  We  then  summarize  oriefly 
progress  made  in  several  areas.  Each  of  these  areas  is  then  explained  in  more  detail  in 
papers,  theses,  internal  reports  that  follow  this  brief  overview.  These  latter  documents  are 
self  sufficient  in  containing  the  necessary  background,  introductions  and  references. 

This  grant  involves  continuing  work  in  the  a  neural-net  methodology  for  high-level  vision. 
We  use  an  optimization  approach  and  optimizing  neural  networks  to  perform  the  sorts 
of  combinatorial  searchs  that  arise  inevitably  in  high-level  vision.  The  premise  is  that 
high-level  vision  is  in  essence  a  matching  process  in  which  complex  models  are  matched  to 
complex  objects.  By  complex,  we  imply  objects  that  are  composed  of  many  parts  and  have 
many  degrees  of  freedom.  We  are  working  on  sev  for  th^se  nonconvex  problems  (3)  trying 
this  out  on  real  objects  (4\  a  firmer  theoretical  framework  based  on  Bayesian  principles. 
Below  we  describe  progress  in  three  areas: 

•  A  learning  algorithm  was  proposed  and  implemented  for  our  graph-matching  net¬ 
works.  Previously,  models  of  objects  were  stored  in  the  network  by  a  simple  hand- 
design  method.  The  new  method  allows  for  a  supervised  learning  methodology  for 
storing  the  models.  The  actual  networks  use  a  combination  of  backprop  and  cluster¬ 
ing  to  approximate  curves  that  tend  to  attain  significant  values  in  localized  regions 
(i.e.  learning  bumps)  The  approach  and  experiments  are  described  in  a  completed 
thesis  by  Grant  Shumaker,  a  copy  of  which  is  enclosed.  This  thesis  also  describes 
numerical  simulations  in  object  recognition  with  an  earlier  proposed  “Lagrange  mul¬ 
tiplier"  network  for  constrained  optimization. 

•  To  date,  our  networks  have  been  applied  to  artificial  object  domains  and  our  focus 
has  been  on  theory  and  methodology.  In  a  paper  by  Volker  Tresp  (this  will  appear 
in  MPS-3),  results  in  the  recognition  of  real  objects  (machined  parts)  are  reported. 
Significant  here  too  is  the  use  of  a  continuation  method  (  two  types  of  mean-field 
annealing)  that  leads  to  much  improved  convergence  and  ability  to  escape  local  min¬ 
ima.  In  one  experiment,  Tresp  improves  on  a  method  proposed  by  Peterson  and 
Soderberg  (IJNS  vol.  1  1989)  by  incorporating  more  complex  constraints  into  the 
optimization  method. 

•  Our  networks  perform  a  sort  of  weighted  graph  matching,  but  their  design  is  heuristic 
in  several  respects.  In  an  internal  report,  Utans  and  Gindi  describe  a  more  pleasing 
approach  in  which  weighted  graph  matching  exists  as  a  form  of  Bayesian  inference. 
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1  Weighted  Graph  Matching 

Consider  the  following  problem  of  pure  graph  matching  (see  Figure  1).  Given  two  graphs  each 
consisting  of  .V  nodes,  let  a  (1  <  a  <  N)  index  the  nodes  of  one  of  the  graphs,  and  i  (1  <  i  <  .V) 
index  the  nodes  of  the  other.  Let  Ga$  be  the  adjacency  matrix  of  one  of  the  graphs. 

_  J  1  if  or  and  0  are  connected  by  an  arc 

|  0  otherwise 

Likewise  <7,7  is  the  adjacency  matrix  of  the  other  graph.  A  sparse  match  matrix  {A/ai},  Afai  G  (0. 1). 
of  dynamic  variables  represents  the  correspondence  between  nodes  a  and  i. 

Graph  matching  then  means  to  find  a  match  matrix  M  that  maps  nodes  in  Gap  to  nodes  in 
[]ij  in  a  structurally  consistent  way;  i.e.  finding  consistent  rectangles  (see  figure  1).  In  a  weighted 
graph  matching  problem,  arcs  have  numerical  weights  Wad  and  w,y  attached;  then  the  optimal 
mapping  also  minimizes  the  difference  of  the  weights  of  arcs  connecting  matched  nodes. 

2  Graph  Matching  as  Bayesian  Inference 

The  problem  we  are  addressing  here  involves  finding  a  match  matrix  M  given  the  data  graph  gt} 
and  the  model  information  (the  m  'del  graph  <7^).  In  a  Bayesian  sense  we  want  to  find  the  MAP 
estimate  for  the  best  M  given  the  data  as  expressed  by  Bayes  theorem; 

rKV) 
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Figure  1:  Graph  Matching:  The  circles  represent  graph  nodes,  the  solid  lines  graph  arcs  and  the 
dotted  lines  matches  between  nodes  in  two  different  graphs.  The  nodes  1,2,3  and  4  form  a  consistent 
rectangle. 


Here,  P(M)  is  the  prior  regarding  a  valid  match  matrix  M  and 

P(g)  =  '£P{g\M)P(M)  (3) 

M 

a  normalizing  constant.  Thus,  we  are  left  with  finding  an  expression  for  P(g\M).  This  can  be 
interpreted  as  finding  the  “forward”  model  (or  grammar)  that  describes  the  generation  of  typical 
data  graphs  assuming  that  we  are  given  a  valid  match  matrix  M. 

We  proceed  by  assuming  that  arcs  in  g  are  generated  independently  and  thus 

=  (4) 

»; 


This  might  not  be  a  realistic  assumptions  for  if  g  decribes  an  object  it  is  more  reasonable  to  assume 
that  there  exists  some  correlation;  for  the  moment  we  make  this  assumption  for  simplicity. 

An  arc  <7,;-  will  be  present  in  the  data  graph  if  both,  nodes  i  and  j,  get  mapped  to  nodes  a  and 
3  in  the  model  graph.  This  means  that  elements  Ma,  and  Mgg  of  M  are  both  1.  In  addition,  nodes 
a  and  beta  must  be  connected  by  an  arc  Gag.  Thus 


PU hi  =  1|M)  =  Ma,MgjGaS  (5) 

where  A/a,  and  Mgj  are  used  as  “switch”  terms.  Since  <7,/  can  assume  only  two  values, 

P(gtj  =  0|M)  =  1  -  P(g„  =  1[M)  (6) 

Ga[j  in  the  mode  graph  can  be  given  a  probabilistic  meaning:  it  is  the  probabilitv  that  in  a  typical 
instance  g  0 i  G  a  paiticular  arc  is  present. 
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For  the  match  between  model  and  data  to  be  unique,  M  must  be  a  permutation  matrix:  Ml}  € 
[0,1]  and  -Y/i;  =  l,E>A/iy  =  1-  If  A/  is  chosen  from  the  domain  of  permutation  matrices, 
P(M)  would  just  be  a  constant  (assuming  uniform  distribution).  Here,  we  want  to  implement  the 
elements  of  M  as  independent  match  neurons  and  therefore  have  to  specify  the  prior  distribution 
in  such  a  way  that  valid  permutation  matrices  are  more  likely  than  others: 

P(M)  =  e~A^2)(Yl,  A/.y-i)*  /-r\ 

which  is  a  Gaussian  approximation  to  the  ideal  distribution 

P(M)  =  n  M; j .  Mij,l)  (8) 

>j  *  j 


Now.  the  final  distribution  is  given  by 

P(U\g)  oc  n^o V^"1)2 


(9) 


3  Extension  to  Weighted  Graph  Matching 

Here,  the  arcs  in  the  data  graph  p,y  have  weights  wtJ  attached;  in  the  “forward"  sense,  we  therefore 
want  to  compute 

P(wtJ\M)  =  Yi  P(w,j\g,},M)P(g,}\M)  (10) 

{Su> 

where  the  summation  is  taken  over  all  values  g,y  can  assume,  i.e.  0  and  1.  Next  we  define 


r(wtj\g,j.M) 


_  yV  ( -  Ma.  AG,  ) 2 

e  ™ 


if  9ij  -  1 
if  g>,  =  0 


or.  using  gtJ  as  a  “switch"  term 


P  (  ^  1 9 ij  i  At )  —  Qij 


(  1  \ 

\V2^aw)e 


(11) 


(12) 


This  expression  states  that  if  there  is  no  arc  g,j  in  the  data  graph  then  there  can  be  no  weight 
Wfj :  otherwise,  the  weight  is  determined  by  the  model  (Gaussian  distributed  with  mean  Wag  and 
variance  <tw).  Thus, 


P(w,}\\f)  =  gi} 


•(iVorWcMf,*,,) 


P(9ij  =  l|At) 


(13) 
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which  allows  us  to  use  the  results  from  the  previous  section.  For  weighted  graph  matching  we  thus 
obtain 


P(M\g.w)  lf  U 


(14) 


(  1  \ 
9ii{^)e  w 


)2 
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Abstract 

This  pewter  presents  a  model-based  neural  vision  system.  Scenes  are  described 
in  terms  of  shape  primitives  and  their  relational  structure.  The  neural  network 
matches  the  primitives  in  the  scene  and  the  primitives  in  a  model  base  by  finding 
the  best  agreement  between  primitives  and  their  relational  structure  under  the 
constraint  that  at  most  one  primitive  in  the  model  base  should  be  assigned  to  a 
primitive  in  the  scene.  The  quality  of  the  solutions  and  the  convergence  speed 
were  both  improved  by  using  mean  field  approximations.  The  approach  was  tested 
in  2-D  and  in  3-D  object  recognition.  The  primitives  are  line  segments  derived 
from  edges  in  the  scenes.  In  the  2-D  problem,  the  recognition  is  independent  of 
position,  orientation,  size  and  slight  perspective  distortions  of  the  objects.  In  the  3- 
D  problem,  stereo  images  are  used  to  generate  a  3-D  description  of  the  visible  (not 
occluded)  portions  of  the  objects  in  the  scenes.  The  scene  description  is  matched 
against  objects  in  a  model  base.  The  best  match  identifies  the  correct  models  and 
using  the  information  obtained  in  the  preprocessing  step,  the  3-D  positions  of  the 
objects  can  be  determined. 


1  Introduction 

Many  machine  vision  systems  and,  to  a  large  extent,  also  the  human  visual  system  are 
model  based.  The  scene  is  described  in  terms  of  shape  primitives  and  their  relational 
structure  and  the  vision  system  tries  to  find  a  match  between  the  scene  description  and 
^familiar*  objects  in  a  model  base.  If  the  scenes  are  acquired  with  a  single  camera  and  the 
objects  are  flat  or  always  viewed  from  the  same  perspective,  the  problem  is  essentially 
2-D.  It  is  often  required  that  the  recognition  be  invariant  to  rotation,  translation,  scale 
and  slight  perspective  distortions.  This  can  be  achieved  if  solely  parameters  invariant 
to  those  transformations  are  used  in  the  recognition  system.  In  this  paper,  a  system 
is  described  in  which  shape  primitives  are  line  segments  derived  from  the  edges  in  the 
image.  The  scene  is  described  in  terms  of  the  relations  of  line  segments,  e.g.  the  angles 
between  line  segments  and  the  logarithms  of  the  ratios  of  their  lengths.  In  the  approach 
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model  scene 


Figure  1;  Match  of  primitive  pato  p,. 

presented  here,  a  neuron  is  assigned  to  every  possible  match  between  primitives  in  the 
scene  and  the  model  base.  The  network  is  designed  to  find  the  best  match  between  the 
scene  description  and  the  model  base  under  the  constraint  that  at  most  one  primitive 
in  the  model  base  is  assigned  to  a  primitive  in  the  scene  description. 

If  the  problem  is  intrinsically  3-D  as  in  many  robotics  applications,  the  vision  system 
should  capture  the  true  3-D  structure  of  the  scene.  Using  the  sensory  information 
available,  a  3-D  description  of  the  scene  must  be  generated  which  can  then  be  compared 
to  3-D  descriptions  of  models  in  the  model  base.  In  this  approach,  this  information  is 
gained  from  two  stereo-images  and  the  correspondence  problem  must  be  solved  first. 
Only  part  of  an  object  is  visible  at  any  one  time  which  in  general  will  only  permit 
a  3-D  description  of  that  portion  of  the  object.  In  this  poster,  we  describe  a  neural 
network  approach  that  offers  an  elegant  method  to  handle  the  uncertainty  in  the  3-D 
scene  description  and  solves  both  the  correspondence  problem  and  the  model  matching 
task. 

2  The  Network  Architecture 

The  activity  of  a  match  neuron  mai  represents  the  certainty  of  a  match  between  a 
primitive  pa  and  in  the  model  base  and  p;  in  the  scene  description.  The  connectivity 
of  the  network  is  most  easily  described  by  the  network’s  energy  function  where  the 
fixed  points  of  the  network  correspond  to  the  minima  of  the  energy  function.  The 
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energy  function  in  the  system  described  here  is  the  sum  of  several  terms.  The  first 
term  evaluates  the  match  between  the  primitives 

EP  =  (1) 

at 

The  function  «Q,  is  zero  if  the  type  of  primitive  pa  is  not  equal  to  the  type  of  primitive  p,-. 
If  both  types  are  identical,  «a,  evaluates  the  agreement  between  parameters  p£(&)  and 
ppl(k)  which  describe  properties  of  the  primitives.  Here,  nai  =  p(£*  |p£(fc)  ~(^W\/ak) 
is  maximum  if  the  parameters  of  pa  and  p,-  match  (Figure  1  3). 

A  direct  comparison  of  the  primitives  is  not  sufficient.  The  evaluation  of  the  match 
between  the  relations  of  primitives  in  Beene  and  data  base  is  performed  by  the  energy 
term  [3j 

Es  =  -1/2  rnaim0j.  (2) 

a£,ij 

The  function  Xai  =  p(Hi,  ‘8  maximum  if  the  relation  between  pa 

and  pq  matches  the  relation  between  p,  and  p;  . 

The  primitives  can  be  interpreted  as  nodes  in  a  graph  and  the  relations  between 
the  primitives  as  labeled  arcs.  Seen  in  this  way,  the  network  solves  a  graph  matching 
problem  (1,  7,  6,  8]. 

Depending  on  the  application,  uniqueness  constraints  may  have  to  be  satisfied. 
These  can  be  incorporated  as  additional  (penalty-)  energy  terms.  For  example,  the 
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constraint  that  a  primitive  in  the  scene  should  only  match  to  one  or  no  primitive  in  the 
model  base  (column  constraint)  can  be  implemented  by  [6] 

=  D(Q>-)  - ‘J1  £"■■■•!•  (3) 

t  or  a 

Ec  is  equal  to  *.ero  only  if  in  all  columns  the  sum  over  the  activations  of  all 
equal  to  one  or  zero  and  positive  otherwise. 

If  neurons  are  employed  that  can  take  on  continuous  values  ( m 3,  € 
additional  term  is  helpful  that  encourages  neurons  to  assume  values  close 
one 

Eb  =  ]Tma.(l  -  mat).  (4) 

at 

2.1  Dynamic  Equations  and  M#*an  Field  Theory 

2.1.1  MFAX 

The  neural  network  should  make  binary  decisions,  match  or  no  match,  but  binary 
recurrent  networks  get  easily  stuck  in  local  minima.  A  higher  probability  of  reaching 
a  lower  local  minimum  can  be  obtained  by  using  the  mean  field  approximation  of 
statistical  physics.  Here,  the  network  is  interpreted  as  a  system  of  interacting  units.  If 
such  a  system  is  in  thermal  contact  with  a  heat  reservoir  of  temperature  T  the  system 
will  be  in  a  state  5  with  the  probability 

e-E(S)/T 

z 

where 

z  =  £e-£<s)/r 
5 

is  the  partitioning  function  of  the  system  and  E(S )  is 
S. 

Such  a  system  minimizes  the  free  energy 

F  =  E  ~TS  (7) 

where  S  is  the  entropy  of  the  system.  At  T  =  0  the  energy  E  is  minimized.  Equation  5 
indicates  that  states  with  low  energy  are  more  likely  but  since  there  are  more  states 
accessible  with  a  higher  energy  the  system  is  with  a  high  probability  close  to  the  mean 
energy  £s  E(S)e~EMT /Z. 

Bad  local  minima  can  be  avoided  by  using  an  annealing  strategy  but  annealing  is 
time  consuming  when  simulated  on  a  digital  com  11  -er.  Using  a  mean  field  approxima¬ 
tion  one  can  obtain  deterministic  equations  by  retaining  some  of  the  advantages  of  the 
annealing  process  [5,  4,  2]. 

The  mean  field  theory  gives  a  good  approximation  for  a  highly  interconnected  net¬ 
work  with  many  neurons.  The  theory  works  under  the  assumption  that  the  change  in 
energy  of  the  system  if  one  neuron  changes  its  state  is  approximately  equal  to  the  change 


(5) 

(6) 

the  energy  of  the  system  in  state 


neurons  is 

(0,1)),  an 
to  zero  or 
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in  energy  of  that  neuron  in  the  field  at  its  location.  The  mean  value  vat  =<  mat  >  of 
a  neuron  becomes 


=  r.>w  E  E  («) 

^  J  S,mai  =  l  S,mai=0 

1  x  eUa'/r  _  1 

gUai/T  g-O/T  q_  g-ti a»/T 


with 


«a.  = 


3E 

5va,  ■ 


(10) 


Equations  9  and  10  can  be  updated  synchronously,  asynchronously  or  solved  itera¬ 
tively  by  moving  only  a  small  distance  from  the  old  value  of  ua{  in  the  direction  of  the 
new  mean  field 


uai(t)  —  uai(*  —  1)  +  A[  ^  tlaj(<  —  1)].  (11) 

At  high  temperatures  T,  the  system  is  in  the  trivial  solution  ra,  =  1/2  Vq,  iland 
the  activations  of  all  neurons  are  in  the  linear  region  of  the  sigmoid  function.  ’The 
system  can  be  described  by  linearized  equations.  The  magnitudes  of  all  eigenvalues 
of  the  corresponding  transfer  matrix  axe  less  than  1.  At  a  critical  temperature  Tc  the 
magnitude  of  at  least  one  of  the  eigenvalues  becomes  greater  than  one  and  the  trivial 
solution  becomes  unstable.  Tc  and  favorable  weights  for  the  different  terms  in  the 
energy  function  can  be  found  by  an  eigenvalue  analysis  of  the  linearized  equation^  [5]; 
MF A\  is  equivalent  to  the  mean  field  theory  of  spin  glasses  [4],  i 


2.1.2  MFAi 


It  is  also  possible  to  obtain  mean  field  equations  which  assure  that  at  every  temperature 
T,  the  column  constraint  is  met.  One  considers  only  states  S  in  which  exactly  one 
neuron  in  every  column  is  equal  to  one  and  all  others  are  equal  to  zero  or  where  all 
neurons  in  a  column  are  equal  to  zero.  Under  the  mean  field  assumption 


with 


vai  w 


1  x  eUa,/r 

i  +  E^u*/r 

(12) 

dE 

(13) 

II 

rii 

The  column  constraint  term  (Equation  3)  drops  out  of  the  energy  function.  The  high 
temperature  fixed  point  corresponds  to  vai  =  1  /(N  +  1)  Va,»  where  N  is  the  number 
of  rows.  MFAi  is  similar  to  the  mean  field  theory  of  Potts  glasses  (5). 
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3  Applications 

3.1  2-D  Object  Recognition 

There  are  many  vision  applications  in  which  one  spatial  dimension  can  be  neglected 
and  the  object  recognition  problem  is  essential  2-D.  Figure  4,  6  and  7  ows  typical 
objects  presented  to  the  recognition  system.  The  task  was  a  recognition  of  the  objects 
in  the  scenes  independent  of  scale,  translation,  rotation  and  small  distortions.  The 
primitives  are  line  segments  that  approximate  edges  in  the  scene.  As  preprocessing 
steps,  the  images  are  threshholded  to  separate  objects  from  background.  A  convolution 
by  a  Laplacian  followed  by  another  threshholding  operation  extracts  the  edges.  Edges 
are  found  typically  at  the  outer  contours  of  an  object  or  at  features  inside  an  object, 
typically  at  holes  (inside  contours).  A  contour  is  traced  with  a  simple  contour  following 
scheme.  The  angle  $  which  the  contour  forms  in  relation  to  a  horizontal  line  is  plotted 
versus  the  arclength  of  the  boundary  transversed  s.  Corners  of  the  object  correspond 
to  maxima  and  minima  of  the  smoothed  first  derivative  dtp/ds  and  line  segments  are 
formed  by  connecting  two  successive  corners.  Contours  found  within  another  outer 
contour  are  considered  an  inner  contour  of  the  same  object  (Figures  5,  6,  7  ) 

A  single  line  segment  can  be  described  by  position,  orientation  and  length.  Since 
none  of  these  parameters  is  invariant  under  the  transformations  mentioned  above,  a 
direct  comparison  between  the  parameters  of  the  primitives  is  not  feasible  and  Ep  —  0. 
The  description  of  scene  and  models  is  encoded  in  only  the  relations  between  line  seg¬ 
ments.  Here,  only  relations  of  line  segments  within  a  local  neighborhood  are  considered. 
Xa,0,i,j  is  equal  to  zero  if  not  both 

•  line  segment  pa  is  attached  to  line  segment  pp  and 

•  line  segment  p,  is  attached  to  line  segment  p;. 

Otherwise,  Xa.p.ij  =  ^(\4>a0  -  /tfj  +  \rap  -  ri:\/a')  where  <p  is  the  angle  between 
line  segment  and  r  the  logarithm  of  the  ratio  of  their  lengths  (Figure  2). 

The  energy  terms  Es,  Ec  and  Ep  are  weighted  by  1,  2,  and  0.1,  respectively,  if 
MFAi  was  used.  If  MFA j  is  employed,  the  energy  function  consists  only  of  Es, 
weighted  by  1. 

3.1.1  Experiments 

The  model  base  consisted  of  6  different  industrial  objects  which  were  typically  described 
by  10  to  30  line  segments  each.  The  recognition  was  tested  on  scenes  with  a  single  object 
in  varying  scale,  position,  illumination  and  orientation.  If  the  illumination  allowed  a 
clear  separation  between  background  and  object,  the  preprocessing  stage  segmented 
the  pieces  into  line  segments  in  the  same  way  as  the  corresponding  pieces  in  the  model 
base  were  segmented  with  variations  on  the  extracted  parameters  4>  and  r  depending 
on  precisely  where  corners  were  found.  The  recognition  of  the  objects  was  always 
successful  and  all  line  segments  matched  correctly  within  about  20  time  steps.  The 
network  successfully  found  the  correct  match  even  if  the  data  base  consisted  of  two 
models  where  one  model  differed  only  in  one  parameter  from  the  other  model. 
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In  one  scene,  all  six  pieces  were  present.  A  total  of  81  line  segments  were  matched 
correctly  in  about  20  time  steps.  When  the  illumination  became  less  uniform,  the 
separation  between  background  and  object  was  not  completely  possible  with  simple 
thresholding.  If  the  segmentation  of  a  contour  of  an  object  was  correct,  that  is  the 
same  as  in  the  model  base,  <f>  and  r  varied  up  to  15  degrees  and  20%  respectively  but 
the  line  segments  were  matched  correctly  demonstrating  the  distortion  insensitivity  of 
the  system.  If  portions  of  a  contour  were  segmented  incorrectly,  the  line  segments 
in  that  portion  were  not  matched,  but  the  line  segments  in  the  correctly  segmented 
portion  of  the  contour  were  matched  correctly.  If  the  model  base  consisted  of  all  6 
pieces,  the  line  segments  in  the  incorrectly  segmented  part  of  the  contour  were  often 
matched  to  line  segments  in  the  wrong  model.  Therefore,  with  large  data  bases,  the 
image  quality  should  be  sufficient  to  allow  a  correct  segmentation. 

The  recognition  was  tested  on  partially  overlapping  pieces.  If  a  sufficient  number  of 
line  segments  in  the  contour  of  each  piece  could  be  segmented  correctly,  these  line  seg¬ 
ments  could  be  matched  and  object  recognition  was  successful  here  as  well  (Figures  8). 

3.2  3-D  Object  Recognition 

3.2.1  The  Correspondence  Problem 

As  before,  images  are  segmented  into  line  segments.  In  the  scene  in  Figure  9,  these 
lines  correspond  to  the  edges,  structure  and  contours  of  the  objects  and  shadow  lines. 
To  solve  the  correspondence  problem,  corresponding  lines  in  left  and  right  images  have 
to  be  identified.  A  good  assumption  is  that  the  object  in  one  image  is  a  distortion  and 
shifted  version  of  the  object  in  the  other  image  with  approximately  the  same  scale  and 
orientation.  Therefore,  the  lengths  /  of  line  segments  are  compared,  nat  =  p(\la-l,\/tf) 
and  the  angles  <t>  and  attachment  points  q  between  adjacent  line  segments  are  compared, 

Xai  =  -  4>ij\lal  +  \<la0  -  (Figure  2). 

Here,  we  have  two  uniqueness  constraints:  only  at  most  one  neuron  should  be 
active  in  each  column  or  each  row.  The  row  constraint  is  enforced  by  an  energy  term 
equivalent  to  Ec 

E R  =  £[((£ rnai)  -  l)3  £  mol].  (14) 

a  i  i 

Figure  10  shows  the  line  segments  and  the  matrix  of  match  neurons  after  10  itera¬ 
tions.  All  line  segments  that  are  present  in  both  images  could  be  matched.  One  of  the 
legs  of  the  wardrobe  was  only  segmented  in  the  right  image  and  has  no  correspondence 
in  the  left  image.  Only  MFAi  was  employed  for  this  problem.  Es  is  weighted  by  1, 
Er  by  10  and  Ep  by  0.1. 

3.2.2  Description  of  the  3-D  Object  Structure 

Next,  a  a  3-D  description  of  the  visible  portion  of  the  object  must  be  generated.  As 
result  of  the  last  section,  we  know  which  endpoints  in  the  left  image  correspond  to 
endpoints  in  the  right  image.  In  the  experiments,  the  two  cameras  were  mounted  iq 
parallel.  If  I?  is  the  separation  of  both  cameras,  /  the  focal  lengths  of  the  cameras, 
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Figure  4:  Typical  scenes. 
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Figure  7:  Overlapping  workpieces. 
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Figure  8:  Network  converges  to  solutions.  Fat  line  segments  are  matched  correctly. 
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Figure  9:  Stereo  images  of  a  scene. 


Figure  10:  Stereo  matching  network, 
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XnVr,xuVl  the  coordinates  of  a  particular  point  in  left  and  right  images,  the  3-D 
position  of  the  point  in  camera  coordinates  x,  y,z  becomes 


z 

y 

x 


Df 

ir  -  xi 

Writ 

xx  T I  f  +  D/2 


(15) 

(16) 
(17) 


Knowing  the  true  3-D  position  of  the  endpoints  of  the  line  segments,  the  system 
concludes  that  the  chair  and  the  wardrobe  are  two  distinct  and  spatially  separated 
objects  and  that  line  segments  12  and  13  in  the  right  image  and  12  in  the  left  image 
are  not  connected  to  either  the  chair  or  the  wardrobe.  On  the  other  hand,  it  is  not 
obvious  that  the  shadow  lines  under  the  wardrobe  are  not  part  of  the  wardrobe. 


3.2.3  Matching  Objects  and  Models 

The  scene  description  now  must  be  matched  with  stored  models  describing  the  com¬ 
plete  3-D  structures  of  the  models  in  the  data  base.  The  model  description  might  be 
constructed  by  either  explicitly  measuring  the  dimensions  of  the  models  or  by  using 
several  stereo  views  of  the  models. 

Here,  n  and  x  ue  the  same  as  in  the  correspondence  problem.  Note,  that  here 
0  is  not  a  very  discriminating  parameter  since  0  ts  90  degrees  for  many  pairs  of  line 
segments. 

The  knowledge  about  the  3-D  structure  allows  a  segmentation  of  the  scene  into 
different  objects  and  the  row  constraint  is  only  applied  to  neurons  relating  to  the  same 
object  O  in  the  scene  Er>  =  £0  £a(((Ei€0  ma.)  “  !)2  E.go  »«]• 

Here,  Es  is  weighted  by  1,  Er>  by  20  and  Dp  by  1.  Figure  11  shows  the  network 
after  20  iterations.  Except  for  the  occluded  leg,  all  line  segments  belonging  to  the  chair 
could  be  matched  correctly.  All  not  occluded  line  segments  of  the  wardrobe  could  be 
matched  correctly  except  for  its  left  front  leg. 


4  2-D  and  3-D  Position 

4.1  3-D 

A  translation  followed  by  a  rotation  can  be  described  by 

^  I,  \  /  J*ii  fi2  7*13  dx  \  (  X o  ^ 

Vi  _  **21  ^20  f23  dy  yo  Qg\ 

z,  r3i  r33  7*33  dz  zq  K  ’ 

\  1  /  \  0  0  0  1  )  \  1  ) 

or  in  matrix  notation  X,  =  RXq.  The  Equation  18  has  to  hold  for  all  pairs  of  points 
Xq  =  (ro,  }ft),  zo)  in  standard  position  and  Xs  —  {x„yt,z,)  in  scene  coordinates. 

The  optimal  solution  (in  the  least  squares  sense)  for  R  is 

Ropt  —  X,Xq 


(19) 
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where  X£  ia  the  pseudo  inverse  of  Xq.  The  coefficients  of  R^t  can  also  be  obtained 
using  three  ADALINEs  as  shown  in  Figure  12. 

If  the  first  rotation  is  about  the  x-axis  by  7,  followed  by  a  rotation  about  the  y-axis 
by  fl,  followed  by  a  rotation  about  the  z-axis  U. 


(3  =  arctan 


—  **31 


yjr\i  +  r2i 


Til 

a  =  arctan  — 
rn 

*  r32 

7  =  arctan  — 

f33 


(20) 

(21) 

(22) 


For  the  chair  in  Figure  9  one  obtains  dx  =  -3.8  cm,  dy  =  4.8  cm,  dz  =  62.9  cm, 
7  =  -12  degrees,  j3  =  -36  degrees  and  a  =  15  degrees  (in  camera  coordinates). 

4.2  2-D 

Scaling,  translation  and  rotation  can  be  described  by; 


f  xt  ^ 

( 

V. 

* 

i 

V  1  ) 

\ 

A  sin  a  A  cos  a  dy 


0 


0 


1  ) 


(  *o  \ 
yo 
7o 
\  1  / 


(23) 


or  in  matrix  notation 


X.  =  RX 0  (24) 

The  coefficients  of  R  can  again  be  derived  using  the  pseudo  inverse  or  an  ADALINE. 


5  Convergence  and  Comparison  of  the  Two  Mean  Field 
Approaches 

In  the  2-D  problem,  both  mean  field  approaches  MFA\  and  MFAi  were  used;  in  the 
3-D  problem,  only  MFAi  was  used.  In  general,  both  mean  field  approaches  converged 
to  the  same  solutions. 

If  MFA\  was  used,  the  network  was  updated  using  equations  9  and  11.  The 
experiments  suggest  that  there  are  two  critical  temperatures:  the  first  when  the  system 
leaves  the  trivial  solution  »a,  =  1/2  Va,  i  and  starts  to  enforce  the  column  constraint 
and  the  second  one  when  it  departs  from  a  solution  where  all  neurons  have  similar  and 
small  activation  to  the  binary  solution  with  only  one  neuron  active  per  column. 

Ec  is  a  penalty  term  that  is  introduced  to  implement  a  constraint  that  should 
be  strictly  enforced.  But  unfortunately  a  large  weight  Cc  for  Ec  leads  to  an  ill- 
conditioned  Hessian  matrix.  In  practice  this  means  that  Ec  describes  a  narrow  ravine 
and  the  system  zig-zags  up  and  down  the  slopes  without  making  much  progress  in  the 
perpendicular  direction.  Also  small  time  steps  have  to  be  taken  to  avoid  oscillations 
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Figure  12:  ADALINE  for  calculation  of  R. 


and  instabilities.  For  these  reasons,  in  some  experiments  we  started  with  a  small  Cc 
and  increase  it  gradually  ever  time.  In  the  experiments,  Cc  increases  exponentially  and 
the  step  size  decreases  exponentially.  The  initial  value  of  Cc  is  scaled  by  1/c.  Through 
this  approach  convergence  speed  could  be  increased.  The  network  started  out  slightly 
above  the  second  critical  temperature  and  annealed  from  T  =  0.8  to  T  —  0.4. 

When  using  MFA2  update  equations  12  and  11  were  used.  Since  the  energy  surface 
is  more  favorable,  larger  time  steps  can  be  taken  and  the  net  converged  much  faster 
(5  -  10  times)  to  a  solution  without  getting  into  oscillatory  states.  The  annealing  . 
started  at  T  -  0.05  and  was  stopped  at  T  =  0.03. 

6  Discussion 

The  experiments  showed  that  the  system  recognizes  objects  robustly  and  reliably.  The 
system  relies  on  the  correct  identification  of  line  segments  and  their  relations  in  the 
scene  in  the  preprocessing  stage.  More  elaborate  approaches  must  be  used  if  the  scenes 
become  more  complex  and  edges  more  ambiguous.  Edge  detection  and  reliable  contour 
following  can  be  increasingly  difficult. 

In  the  3-D  problem,  a  hierarchical  system  can  be  considered.  In  the  first  step, 
simple  objects  such  as  squares,  rectangles,  and  circles  etc.  are  identified  and  these  form 
the  primitives  in  a  second  stage  to  recognize  complete  objects.  It  is  also  possible  to 
combine  these  two  matching  nets  into  one  hierarchical  net  as  described  in  [3]. 
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An  objective  function  for  model-based  object  recognition  is  formulated  and  used 
to  specify  a  neural  network  whose  dynamics  carry  out  the  optimization,  and  hence  the 
recognition  task.  Models  are  specified  as  graphs  that  capture  structural  properties 
of  shapes  to  be  recognized.  In  addition,  compositional  [INA)  and  specialization  (7S4) 
hierarchies  imposed  on  the  models  as  an  aid  to  indexing  and  are  represented  in  the 
objective  function  as  sparse  matrices.  Data  are  also  represented  as  a  graph.  The 
optimization  is  a  graph-matching  procedure  whose  dynamical  variables  are  “neurons” 
hypothesizing  matches  between  data  and  model  nodes.  The  dynamics  are  specified 
as  a  third-order  Hopfield-style  network  augmented  by  hard  constraints  implemented 
by  “Lagrange  multiplier”  neurons.  Experimental  results  are  shown  for  recognition  in 
Stickville,  a  domain  of  2-D  stick  figures.  For  small  databases,  the  network  successfully 
recognizes  both  an  object  and  its  specialization.  Learning  of  recognition  parameters 
is  implemented  through  the  use  of  a  modified  back  propagation  network. 
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1  Vision 

1.1  Goals 

The  process  of  vision  may  be  formulated  as  an  information- ptocession  task.  The  visits1 
processing  apparatus  may  be  relegated  to  a  isolated  black  box,  with  a  defined  input, 
in  this  case  a  visual  scene,  and  a  desired  output,  the  information  which  is  contained 
within  the  scene.  To  formulate  the  problem  in  this  manner,  however,  immediately 
raises  fundamental  questions  about  vision.  What  is  the  nature  of  the  visual  scene 
used  as  input?  Is  it  a  simple  intensity  display,  does  it  contain  information  concerning 
depth,  texture,  motion?  What  is  the  desired  output?  Is  it  simply  a  catalog  of  all 
objects  recognized  within  the  object  and  their  positions?  This  has  been  a  goal  of 
industrial  computer  vision.  Or  is  the  goal  to  obtain  some  kind  of  understanding  as 
to  the  relationship  of  objects  within  the  scene;  an  understanding  of  the  image  rather 
than  a  processing  of  the  image. 

Goals  may  also  be  viewed  as  a  hierarchy  which  emphasize  different  aspects  of  the 
visual  process.  Object  advoidance  is  important,  but  may  not  lead  to  an  understanding 
of  the  visual  scene.  More  complex  goals  may  be  to  extract  edge  information,  charac¬ 
terize  objects  in  the  image,  or  understand  their  relationship.  Additionally,  goals  at 
the  more  complex  level  may  be  created  by  the  visual  information  presented. 

A  second  consideration  in  vision  is  in  the  hardware  that  is  used  to  implement  the 
visual  algorithms  within  the  black  box.  The  nature  of  an  information-processing  task 
suggests,  but  does  not  require,  particular  constraints  on  the  equipment  used  to  process 
the  information.  Parallel  hardware  and  algorithms  are  well  suited  to  processing  the 
vast  amounts  of  information  in  the  initial  processes  of  vision.  These  constraints  direct 
the  search  for  algorithms  used  to  process  the  information  in  the  task.  Therefore,  to 
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understand  and  characterize  any  information-processing  task,  you  must  study  the  task 
as  it  is  accomplished  by  a  particular  configuration  of  equipment  used  to  implement 
it. 


1.2  Psychophysical  and  Neurophysical  studies 

One  approach  to  studying  visual  systems  is  to  disregard  what  is  known  about  human 
vision.  This  approach  may  indeed  have  advantages,  such  as  ease  of  implementation 
and  in  industrial  applications.  However,  what  is  known  about  the  process  of  visual 
understanding  comes  from  human  vision.  It  stands  as  proof  that  visual  understand¬ 
ing  is  both  possible  and  practical.  It  provides  both  a  starting  point  and  a  working 
example. 

Drawing  on  previous  psychophysical  studies  of  vision,  various  investigators  have 
proposed  algorithms  to  model  limited  aspects  of  perception.  Some  of  the  psychophys¬ 
ical  studies  investigated  the  binocular  nature  of  vision.  Bela  Julesz  devised  random 
dot  stereograms  which  created  an  illusion  of  depth  perception  [26].  Such  investiga¬ 
tions  show  that  depth  perception  does  not  need  to  depend  on  information  such  as  the 
identity  of  more  complex  objects,  but  rather  can  proceed  solely  on  pattern  matching 
alone. 

Roger  Shepard  and  Jacqueline  Metzler  investigated  the  psychophysics  of  pattern 
recognition  [27].  In  their  experiment,  they  presented  subjects  with  groups  of  line 
drawings  of  simple  block  objects.  The  presented  objects  differed  by  rotation  and/or 
mirror  inversion.  They  found  that  recognition  time  varied  directly  with  the  amount 
of  rotation  difference  between  the  images. 

These  psychophysical  studies  provide  information  about  isolated  modules  that  the 
human  visual  apparatus  uses  to  identify  information  contained  within  scenes.  They 
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direct  the  search  for  algorithms  toward  ones  that  react  in  the  same  manner  as  observed 
experimentally. 

A  second  approach  taken  was  that  of  neurophysiological  studies  of  retinal  cells 
[28].  Barlow  studied  thp  ganglion  cells  of  the  frog  retina.  Such  ganglion  cells  were 
monitored  as  their  receptive  field  was  stimulated,  noting  which  responses  activated 
the  cell.  Such  cells,  when  stimulated  by  an  appropriate  visual  stimulus,  often  invoked 
a  feeding  response  in  the  frog;  the  frog  would  jump  toward  the  stimulus  and  snap  its 
mouth.  This  response  is  important,  for  it  shows  that,  at  least  in  the  frog,  a  large 
degree  of  information  processing  has  occurred  at  the  level  of  the  retina,  and  at  the 

level  of  a  single  neuron.  These  observations  lead  Barlow  to  state: 

The  cumulative  effect  of  all  the  changes  I  have  tried  to  outline  above  has 
been  to  make  us  realize  that  each  single  neuron  can  perform  a  much  more 
complex  and  subtle  task  than  had  previously  been  thought.  Neurons  do  not 
loosely  and  unreliably  remap  the  luminous  intensities  of  the  visual  image 
onto  our  sensorium,  but  instead  they  detect  pattern  elements,  discriminate 
the  depth  of  objects,  ignore  irrelevant  causes  of  variation  and  are  arranged 
in  an  intriguing  hierarchy. 

This  lead  to  perhaps  an  overstatement  of  Barlow, 

A  description  of  that  activity  of  a  single  nerve  cell  which  is  transmitted 
to  and  influences  other  nerve  cells  and  of  a  nerve  cell's  response  to  such 
influences  from  other  cells,  is  a  complete  enough  description  for  functional 
understanding  of  the  nervous  system.  There  is  nothing  else  “looking  at'’ 
or  controlling  this  activity,  which  must  therefore  provide  a  basis  for  un¬ 
derstanding  how  the  brain  controls  behaviour. 

The  ability  to  monitor  the  signal  of  individual  neurons  lead  investigators  to  trace 
the  behavior  of  neurons  at  successivly  deeper  neural  levels.  Hubei  and  Wiesel  studied 
the  nature  of  neurons  receptive  fields  at  the  level  of  the  visual  cortex  of  the  cat, 
finding  organization  into  functional  columns  [35]. 

These  statements  directed  neurophysiological  studies  for  the  following  decades.  In¬ 
vestigators  searched  for  cells  containing  ‘units’  of  information.  Gross,  Rocha-Miranda, 
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and  Bender  found  cells  in  the  inferotemporal  cortex  which  were  activated  when  a  hand 
was  present  in  the  visual  field  [29].  They  were  a  stepping  stone  on  the  way  to  finding 
the  elusive  grandmother  cell,  a  cell  that  is  activated  when  one’s  grandmother  comes 
into  view. 

The  psychophysical  and  neurophysical  studies  were  useful  in  describing  the  ob¬ 
served  behavior  at  a  cell  or  an  individual  level,  but  were  only  that.  They  did  not  lead 
to  an  understanding  or  and  explanation  of  the  process  of  vision. 

1.3  Artifical  visual  systems 

Artifical  visual  systems  have  simplified  the  problem  by  dividing  the  visual  process 
into  low  level  and  high  level  vision.  The  term  low  level  vision  is  used  to  describe  the 
extraction  of  basic  information  from  the  visual  scene.  It  is  the  ability  to  characterize 
luminosity,  depth,  color,  texture,  and  depth  perception.  It  makes  generic  assumptions 
about  the  information  being  processed,  and  is  iconic  in  nature.  One  specific  example 
is  in  extracting  shape  information  based  on  shading  of  objects  within  the  scene.  This 
information  provides  an  intermediate  representation  of  the  visual  scene  suitable  for 
high  level  visual  processes  to  work. 

One  distinguishing  feature  of  high  levti  vision  is  that  it  emphasizes  contextual 
knowledge,  i.i.  matching  to  exprected  models,  as  opposed  to  low-level  vision,  whose 
processing  is  iconic,  uniform  and  simple. 

Artifical  visual  systems  have  been  created  to  implement  selected  aspects  of  the 
visual  process.  Several  different  approaches  have  been  taken  to  achieve  this  goal.  The 
first  is  to  try  to  improve  the  low-level  visual  information  being  extracted  from  the  raw 
image.  This  approach,  for  example,  tries  to  improve  the  performance  of  edge-detector 
operators  that  are  applied  to  the  raw  image.  This  approach  leads  to  the  generation 
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of  sets  of  lines  or  other  image  information  which  are  used  to  describe  the  image. 

The  next  step  or  goal  of  the  process  is  in  how  to  use  this  information  to  generate 
information  about  the  actual  content  of  the  scene.  If  the  initial  scene  is  limited  in 
scope,  such  as  a  collection  of  toy  blocks,  then  them  are  defined  algorithms  to  generate 
one  from  the  other. 

An  approach  by  Waltz  achieves  this  by  cataloging  all  possible  verticies  contained 
within  the  image  [32].  By  examining  the  type  of  vertex,  and  by  grouping  sets  of 
vertices  together  according  to  the  edges  with  interconnect  them,  it  is  possible  to  fit 
an  object  or  limited  set  of  objects  to  the  edge  information. 

Several  groups  have  extended  this  type  of  “catalog”  approach  to  images  with 
contain  non-convex  polyhedra  [31].  Waltz  has  also  extended  this  approach  by  the 
addition  of  shadows  [32].  These  programs  are  able  to  examine  groups  of  possible 
vertices,  eliminate  those  which  are  spurious  (due  to  shadows,  object  overlap)  and 
label  the  remaining  vertices,  objects,  and  planes.  They  do  so  at  the  expense  of  a 
combinatorically  expanding  catalog  of  possible  edge  and  vertex  tables.  The  idea  of 
using  a  stored  model  of  a  shape  to  assist  the  recognition  process  was  introduced  by 
Roberts  [30],  in  a  computer  program  that  produced  edge  descriptions  of  images  built 
with  cubes  and  wedges. 

Such  approaches  rapidly  break  down,  however,  for  more  complex,  realistic  imagery. 
The  approach  is  also  inadequate  when  image  information  is  incomplete,  or  possibly 
inconsistent. 

1.4  Computational  Theory 

David  Marr  has  formalized  the  vision  problem  by  casting  it  as  an  of  information 
processing  task  [18].  In  doing  so,  he  separated  the  problem  into  three  distinct  levels 
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which,  while  independent,  interact.  The  separate  levels  are  a)  computational  theory, 
b)  representation  and  algorithm,  and  c)  hardware  implementation. 

The  first,  computational  theory,  concerns  the  abstract  goals  of  the  process  to  be 
achieved.  It  formally  states  the  desired  transformation  to  be  achieved.  With  regard 
to  stereopsis,  the  goal  is  to  extract  the  notion  of  depth  from  the  scene.  The  details 
of  how  the  extraction  is  to  occur  is  not  specified  at  this  level. 

Th  ’  second  level  concerns  representations  and  algorithms.  The  same  information 
may  be  represented  in  many  different  forms.  This  representation  may  be  tailored  to 
suit  the  desired  transformation,  and  may  vary  at  different  steps  in  the  processing  task. 
Some  examples  of  the  representation  of  visual  information  are  gray-level  intensity 
images  and  semantic  networks  of  related  objects  within  the  scene. 

The  second  level  also  concerns  the  algorithms  used  to  implement  the  desired  trans¬ 
formation  specified  in  the  first  level,  and  how  these  algorithms  interact  to  share  in¬ 
formation.  The  algorithms  and  the  representation  of  the  data  are  intimately  coupled. 
The  algorithm  for  longhand  multiplication  is  straightforward  when  the  numbers  are 
represented  in  a  base  ten  system,  but  quite  difficult  when  the  data  is  represented  in 
Roman  numerals.  The  algorithms  in  a  visual  system  are  the  programs  which  attempt 
to  extract  information,  such  as  edges,  from  the  visual  data. 

The  third  level  concerns  the  details  of  the  hardware  upon  which  the  algorithms  are 
to  be  implemented.  The  hardware  may  range  from  electrical  transistors,  connected 
in  serial  or  parallel  configurations,  to  the  biological  neurons. 

The  levels  are  loosely  linked.  The  set  of  algorithms  available  are,  to  an  extent, 
constrained  by  the  nature  of  the  hardware  available.  Similarly,  there  is  ambiguity 
as  to  which  level  a  task  must  be  assigned.  At  what  level  should  neuropsychological 
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phenoma  be  assigned,  at  the  detector  (hardware)  level,  or  at  an  algorithm  level,  or 
does  it  arise  from  an  interaction  between  levels?  Additionally,  the  level  at  which 
various  tasks  are  processed  may  shift  with  the  changing  nature  of  goals,  algorithms, 
and  hardware. 

Within  this  multi-level  context,  it  is  possible  to  roughly  separate  the  contributions 
of  the  various  fields.  Psychophysics  may  be  loosely  linked  to  the  level  of  information 
representation  and  the  algorithms  applied  to  it.  Different  visual  algorithms  fail  in 
markedly  different  manners  when  pushed  to  extremes.  These  failures  observed  in 
psychophysics  may  help  guide  the  search  at  the  algorithm  level.  Neuroanatomy  more 
closely  applies  to  the  hardware  that  implements  the  algorithms  and  stores  the  repre¬ 
sentations. 

This  approach  allows  one  to  analyze  why  various  visual  systems  fail.  Systems  may 
fail  because  their  goal  was  limited  (level  1),  the  algorithms  or  data  representations 
are  not  flexible  enough  to  allow  the  goal  (level  2),  or  the  hardware  is  not  appropriate 
to  the  task  (level  3). 

2  Recognition 

One  of  the  processes  that  a  visual  system  must  achieve  is  the  recognition  of  objects 
observed  in  the  visual  field.  The  recognition  may  be  described  as  occurring  at  several 
levels,  starting  from  information  provided  by  the  low  level  visual  processes.  Object 
recognition  is  a  process  of  constructing  an  abstract  description  of  the  data,  so  that  a 
small  amount  of  information  (e.g.  a  “car”)  can  effectively  summarize  a  large  amount 
of  data  (e.g.  millions  of  bits  comprising  the  image  of  a  car)  for  purposes  of  high-level 
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One  method  of  accomplishing  recognition  is  to  decompose  an  object  into  its  parts, 
and  solve  the  problem  by  recognizing  the  parts.  The  matching  is  relational  in  that 
recognition  is  contextually  dependent  on  the  disposition  of  other  parts.  In  fact,  the 
grouping  of  image  data  into  parts  and  the  matching  of  these  parts  must  usually 
proceed  simultaneously  and  interact  with  each  other.  Such  recognition  processes  must 
function  when  limited  information  is  available,  such  as  when  shadows  or  occlusion 
obscure  parts  of  the  image. 

2.1  Matching 

The  process  of  recognition  consists  of  matching  the  observed  part  within  the  image 
to  a  stored  database  of  known  objects.  Such  internal  descriptions  of  objects  may 
concern  what  are  thought  to  be  important  information  about  the  object,  such  as  size, 
color,  texture,  shape,  or  other  available  information. 

What  remains  is  to  match  the  observed  object  to  the  list  of  internal  descriptions 
available.  Problems  arise,  however,  in  that  several  objects  may  partially  fit,  creating 
confusion.  One  approach  often  taken  is  to  limit  the  input  objects  or  the  description 
parameters  so  as  not  to  create  the  confusion  of  overlap. 

Another  possible  approach  is  in  extending  the  scope  of  the  matching  process  of 
an  individual  object  to  encompass  the  matching  of  other  objects  within  the  scene. 
This  is  often  possible  since  other  objects  within  the  scene  contain  information  which 
may  bear  upon  the  recognition  of  the  original  object  (a  head  may  help  to  recognize 
a  hand  instead  of  a  paw).  In  doing  so,  an  additional  level  of  matching  is  uncovered, 
that  of  matching  groups  of  internal  descriptions  to  possible  groups  of  objects  within 
the  scene. 

A  limit  is  placed  on  the  combinatoric  nature  of  the  memory  by  separating  objects 
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into  their  constituent  parts.  Multiple  objects  may  be  created  from  a  limited  number 
of  constituent  parts,  and  recognition  then  becomes  a  two  level  process.  The  first  is  in 
recognizing  the  limited  number  of  parts,  and  the  second  in  matching  a  group  of  the 
parts  to  groups  of  parts  in  the  image  memory.  These  two  levels  may  interact  to  help 
the  matching  process  of  the  other. 

2.1.1  Hierarchy 

Models  may  be  organized  into  hierarchies  in  order  that  large  numbers  of  objects  may 
be  stored  in  the  database.  Two  types  that  are  useful  are  compositional  hierarchies 
and  specialization  hierarchies.  Compositional  (■‘part-of’")  hierarchies  impose  specific 
top-down  restrictions  on  the  combinations  of  low-level  elements  which  need  to  be 
considered,  thus  reducing  search  cost.  Specialization  (“type-of”)  hierarchies  reduce 
work  by  allowing  incremental  recognition.  To  do  recognition,  the  vision  system  must 
dynamically  find  groups  of  parts  in  the  image  whose  internal  relations  satisfy  the 
definitions  imposed  by  the  stored  models  in  the  parts  hierarchy. 

2  1.2  Invariance 

A  goal  of  a  recognition  system  applied  to  a  nontrivial  domain  is  that  it  must  recognize 
patterns  and  objects  regardless  of  simple  transformations  such  as  translation,  rota¬ 
tion,  scaling,  and  perspective.  The  recognition  should  be  able  to  operate  with  partial 
information  of  the  object,  and  should  be  able  to  recognize  a  large  number  of  objects. 
It  is  desirable  to  avoid  a  visual  memory  which  achieves  this  latitude  by  separately 
storing  every  conceivable  view  of  every  object  to  be  recognized. 
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2.1.3  Invariant  parameter  matching 

Matching  may  be  approached  through  parameter  matching.  Invariant  aspects  of  the 
objects  or  groups  of  objects  may  be  used  to  construct  a  list  of  parameters  describing 
each  object  or  collection  of  objects.  Matching  then  becomes  a  table  lookup  problem 
to  recall  the  label  associated  with  the  set  of  parameters.  Such  approaches  must  decide 
what  are  the  invariant  aspects  of  objects  which  are  used  to  generate  the  parameters. 

2.1.4  Parameter  optimization  matching 

The  internal  images  may  be  stored  not  as  objects  but  as  mathematical  functions 
describing  the  image,  with  a  set  of  tunable  parameters.  Matching  may  then  proceed 
by  fitting  the  parameters  to  the  observed  objects.  This  can  be  used  when  there  is  a 
known  object  to  be  found,  though  it  may  vary  from  the  internal  object  due  to  size 
or  rotation.  These  techniques  work  well  when  the  input  data  objects  be)''  ,  to  a 
limited  set  of  known  objects,  and  the  degrees  of  variation  are  limited.  This  technique 
has  been  employed  by  Marr  and  Nishihara  [33]  in  matching  articulated  collections  of 
cylinders  to  match  observed  data. 

2.1.5  Graph  theory 

Matching  problems  may  also  be  formulated  in  terms  of  graph  matching.  The  inter¬ 
nal  description  of  a  group  of  objects  may  be  viewed  as  a  relational  graph  structure 
containing  the  objects  as  vertices  which  are  connected  with  arcs,  describing  some  re¬ 
lationship  between  the  parts.  The  arc  may  be  used  to  describe  a  physical  connection, 
such  as  the  object  arm  is  connected  to  the  object  body.  Such  objects  are  termed 
graphs. 
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If  the  object  contained  within  the  scene  is  also  described  in  terms  of  a  graph, 
then  the  recognition  process  at  the  top  level  consists  of  matching  the  input  graph  to 
a  set  of  stored  graphs.  The  matching  used  is  broken  down  in  the  graph  matching 
field  into  that  of  a)  graph  isomorphism:  matching  graph  A  to  graph  B  with  a  one- 
to-one  relationship,  b)  subgraph  isomorphism:  matching  graph  A  to  a  subgraph  of 
B,  and  c)  double  subgraph  isomorphisms:  matching  subgraphs  of  A  to  subgraphs  of 
B.  Again,  the  problems  are  complicated  by  the  possibility  of  missing  or  extra  parts, 
and  the  addition  of  the  notion  of  ‘goodness  of  fit’  parameters  which  may  regulate  the 
matching  of  nodes  of  the  graphs. 

One  aspect  of  graph  matching  which  is  not  immediately  apparent  is  that  of  the  re¬ 
quired  computational  complexity  of  the  algorithms.  Most  graph  matching  algorithms 
are  known  as  NP-complete,  all  known  solutions  require  a  computational  power  that 
is,  at  worse  case,  exponential  in  relationship  to  the  size  of  the  graphs  needed  to  be 
matched.  This  exponential  nature  of  recognition  is  not  observed  in  visual  psychophys¬ 
ical  studies.  Indeed,  scenes  with  detailed  information  are  often  recognized  faster  than 
scenes  with  less  detailed  information. 

3  Artificial  Neural  Networks 

Artificial  neural  networks  are  collections  of  computational  units,  termed  “neurons,” 
which  are  interconnected.  The  computational  units  perform  a  specified  transforma¬ 
tion  of  their  input(s)  to  their  output.  Input  values  may  also  have  “weighting  factors” 
associated  with  them.  The  derived  output,  then  becomes  the  input  for  some  specified 
set  of  other  neurons.  Styles  of  networks  vary  according  to  the  actual  transformation 
performed  by  the  neurons  (binary,  linear,  sigmoidal)  and  the  nature  of  the  intercon- 
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nections  (layered,  symmetrical, complete).  One  model  we  have  used  is  the  Hopfield- 
style  analog  neural  network  [2].  This  model  uses  a  sigmoidal  transformation  function 
and  has  symmetrical  interconnections.  Starting  values  and  interconnection  weights 
are  set,  then  the  neural  network  is  allowed  to  “relax,”  seeking  a  stable  state.  The 
output  is  the  set  of  final  resting  output  values  of  the  neurons. 

We  have  chosen  to  implement  the  recognition  algorithms  in  the  form  of  opti¬ 
mization  of  objective  functions  implemented  in  neural  networks.  Objective  functions, 
which  specify  how  graph  matching  is  to  occur,  are  attractive  ways  to  formulate  vi¬ 
sual  algorithms.  Such  formulations  are  concise,  being  expressible  in  several  lines  of 
algebra.  Notions  of  hierarchy  are  easily  incorporated  in  objective  functions. 

An  additional  advantage  of  objective  functions  is  that  they  may  be  emulated 
in  function  by  a  Hopfield  style  analog  neural  network  [2].  Solving  the  objective 
functions  becomes  a  problem  in  optimization  theory:  finding  a  solution  consists  of 
finding  minima  of  the  objective  function.  Analog  neural  networks  may  be  designed  to 
achieve  this  objective,  with  the  additional  advantages  of  error  tolerance  of  both  the 
hardware  and  of  the  data,  and  the  advantage  of  learning  abilities  of  neural  networks. 

3.1  Learning 

Learning  may  be  viewed  as  the  process  of  modifying  a  pre-existing  response  so  that  it 
better  fit  a  desired  response  to  a  stimulus.  Learning  may  occur  when  the  stimulus  and 
the  desired  response  are  known.  It  then  becomes  a  matter  of  adjusting,  via  a  learning 
algorithm,  the  internal  parameters  which  control  the  response.  Generalization  of  the 
learned  response  occurs  when  the  modified  algorithm  is  applied  to  new  input  data. 

Separate  from  the  process  of  recognition  is  the  process  of  learning.  Learning  in 
vision  systems  may  consist  of  adding  new  objects  to  the  stored  image  database,  or 
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modifying  the  parameters  of  already  stored  images.  Learning  may  also  consist  of 
extracting  abstractions  from  groups  of  items.  The  nature  of  neural  networks  allows 
learning  to  occur,  although  the  process  of  learning  a  database  is  more  comple:  than 
in  applying  that  information.  Specifically,  we  concentrate  on  learning  to  discriminate 
match  metrics  that  characterize  individual  models,  fine-tuning  the  image  recognition 
process. 

Experiments  by  Hurlbert  and  Poggio  [21]  have  shown  that  neural  networks  are 
capable  of  learning  a  color  algorithm.  Using  optimal  linear  estimation  to  synthesize 
the  algorithm  and  a  large  paired  set  of  input  and  desired  output  images,  they  were 
able  to  train  a  neural  network  to  perform  a  “lightness  algorithm.”  Interestingly,  the 
learned  algorithm  resembled  a  “lightness  algorithm”  previously  proposed  by  Land  [34]. 
One  limitation  of  the  learning  algorithm  employed  is  that  the  derived  algorithms  are 
strictly  linear  in  nature,  which  may  not  be  a  valid  assumption  of  the  desired  algorithm. 

Another  approach  to  learning  in  neural  networks  is  the  back-propagation  learning 
algorithm.  The  neural  network  model  consists  of  an  input  layer,  an  output  layer,  and 
possibly  intervening  layers.  Training  vectors  consist  of  an  input  pattern  and  a  desired 
output  pattern.  For  each  training  vector,  an  error  signal  is  generated  between  the 
desired  output  and  the  actual  output,  and  this  signal  is  propagated  from  output  to 
input  to  reduce  the  error  signal.  One  advantage  of  this  algorithm  is  that  it  operates  on 
neurons  with  nonlinear  output  functions,  thereby  allowing  the  learning  of  non-linear 
algorithms. 

The  back-propagation  technique  has  been  applied  to  the  task  of  deriving  a  shape 
from  shading  visual  algorithm.  Curvature  is  a  basic  property  of  objects  within  a 
visual  scene.  Shading  of  objects  depends  on  multiple  factors,  such  as  illumination, 
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orientation,  and  curvature.  Various  analytical  processes  exist  to  obtain  the  curvature 
of  an  object  in  an  image  from  the  observed  shading.  Sejnowski  and  Lehky  [23]  have 
shown  that  it  is  possible  to  train  a  neural  network  to  extract  curvature  information 
from  shading  information.  They  used  a  three-layer  neural  network  and  applied  the 
back-propagation  technique  using  a  training  set  of  2000  images  of  parabolic  surfaces. 
One  interesting  aspect  of  this  experiment  was  that  the  “neurons”  of  the  hidden  middle 
layer  of  the  neural  network  specialized  in  their  fuctions,  with  regard  to  orientation  of 
curvature  and  the  sign  of  the  curvature. 

3.2  Combinatorial  optimization 

Many  of  the  algorithms  used  in  recognition  are  combinatorial  in  nature.  Computa¬ 
tional  time  to  solve  the  problem  increases  exponentially  in  respect  to  the  size  of  the 
problem  to  be  solved,  and  must  be  overcome  in  order  to  construct  practical  visual 
recognition  systems. 

A  highly  abstracted  version  of  graph  matching  has  been  used  by  Hopfield  and 
Tank  to  solve  the  travelling  salesman  problem  (TSP)  [3].  Simply  stated,  the  goal  is 
to  find  the  shortest  path  connecting  a  set  of  points,  with  the  constraint  that  every 
point  be  visited  only  once.  Such  a  problem  is  easily  restated  as  an  objective  function, 
and  the  problem  becomes  one  of  minimizing,  or  optimizing,  the  objective  function. 
Tank  implemented  the  objective  function  with  analog  style  neural  networks.  Such 
networks,  while  not  usually  finding  the  best  answer,  quickly  find  quite  good  solutions 
to  the  problem. 

Shown  in  Figure  1  are  two  graphs.  We  may  represent  a  graph  by  a  sparse  binary 
connection  matrix  whose  ijf’th  element  is  unity  if  node  i  is  connected  to  node  j , 
and  is  zero  otherwise.  The  two  graphs  shown  below  are  represented  by  symmetric 


3  ARTIFICIAL  NEURAL  NETWORKS 


19 


connection  matrices  Gal j  and  gty  A  matching  matrix  Mai,  where  0  <  Ma,  <  1,  of 
dynamic  variables,  which  we  call  “match  neurons”,  represents  the  correspondence 
between  nodes  a  and  i  of  the  two  graphs  to  be  matched.  The  A/a,  are  elements 
that  make  decisions.  Their  activation  determines  whether  a  is  matched  to  i  ( AfQ, 
activated)  or  not  matched  (A/a,  not  activated.)  Ma,  neurons  are  analog,  and  may 
assume  any  value  between  0  and  1.  These  intermediate  values,  obtained  when  the 
neural  network  has  not  reached  a  stable  state,  are  the  strength  of  the  hypothesis  that 
q  and  i  are  matched. 


Ga|J  9  ij 


®oooo 
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Figure  1:  An  illustration  of  a  graph  matching  problem 


Parts  being  matched  form  local  consistency  rectangles  [7].  Such  a  rectangle  con¬ 
sists  of  a  pair  of  G-linked  model  nodes  connected  by  their  respective  match  neurons 
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M  to  a  pair  of  ^-linked  data  nodes.  An  example  of  such  a  rectangle  is  illustrated 
in  Figure  1,  consisting  of  the  series  G 52  A/2 2  <?2i  A/si-  A  consistency  rectangle  is  a 
structure  that  supports  matching  based  on  contextual  evidence.  For  example,  the 
match  A/51  between  node  5  in  G  and  node  1  in  g  is  enhanced  because  both  nodes 
have  similar  neighborhoods  (nodes  2  in  G  and  2  in  g)  and  the  neighbors  themselves 
match  (A/22  =  1).  A  simple  objective  (or  energy)  function  maximizes  the  number  of 
consistency  rectangles  by  rewarding  the  activation  of  such  collections: 

a0  ij 

Versions  of  this  objective  function  for  graph  matching  have  been  proposed  by  Hopfield 
[15]  and  von  der  Malsburg  [16].  This  may  be  understood  as  follows:  given  a  connection 
between  a  and  0  and  a  connection  between  i  and  j,  the  global  energy  term  may  be 
minimized  by  strengthening  Map  and  Mty  The  M  neurons  represent  a  possible 
correspondence  between  pairs  a<3  and  pairs  ij.  Without  any  limiting  constraints, 
such  a  term  merely  maximizes  the  total  number  of  consistent  rectangles. 

Other  terms  are  necessary  to  reflect  the  constraint  of  one-to-one  matches  between 
model  nodes  and  data  nodes,  limiting  the  proliferation  of  consistency  rectangles: 

EE».,-i)  =  « 

<*  i 

t  Or 

These  may  be  understood  as  follows:  the  sum  of  M  neurons  connected  to  any  particu¬ 
lar  a  (or  i )  must  equal  unity.  If  the  M  were  to  assume  binary  values,  then  these  rules 
require  that  each  o  (or  1)  be  matched  to  one  and  only  one  i  (or  a).  Such  constraints 
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may  be  expressed  as  penalty  terms  in  an  objective  function  and  summed: 

e(m) = c,  £(£  +  Mai  -  if 

ot  t  i  or 

Another  term  is  used  to  encourage  the  match  neurons  to  assume  binary  values: 

£(M)  =  £Mat(  1-Mm). 

at 

This  term  is  minimized  as  each  Mot  approaches  0  or  1.  Another  term  that  is  used  in 
Stickville  is  the  analog  gain  term  [2]: 

rM°' 

E(M)  =  W  g~\x)dx 

at  J 

g{x)  =  ~{l+tanh(x)) 

where  g  is  a  sigmoidal  gain  function.  Such  a  term  acts  as  a  barrier  to  prevent  the 
allowed  range  of  M  from  being  exceeded  by  increasing  the  energy  needed  to  approach 
the  edges  of  the  range.  In  a  discrete  simulation  of  an  analog  neural  network,  this  has 
the  effect  of  reducing  the  step  size  as  the  edge  of  the  boundary  is  approached. 

In  Figure  l,  the  match  matrix  is  depicted  as  a  2-D  array  of  match  neurons  (circles) 
whose  values  are  proportional  to  the  radius  of  the  shaded  region  of  circle.  The  solution 
depicted,  a  1-0  permutation  matrix,  satisfies  the  various  constraints  and  is  consistent 
with  an  optimal  matching  of  the  two  graphs  shown  in  the  figure. 

As  shown  in  [2],  such  objective  functions  specify  the  connection  weights  and  biases 
for  a  network  of  neurons  Mat.  Suitable  dynamics,  discussed  in  later  sections,  may  be 
specified  so  that  the  network  evolves  to  find  a  local  minimum  of  the  objective  function. 
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The  dynamics  approximate  a  gradient  descent  on  the  energy  surface  created  by  the 
constraint  functions.  The  gradient  descent  is  distorted  by  the  barrier  terms  near  the 
boundaries.  A  discussion  of  the  effects  of  the  barrier  term  on  the  network  dynamics 
may  be  found  in  [9]. 

3.3  Varieties  of  Neural  Networks 

The  details  of  implementation  of  neural  networks  differ  with  respect  to  the  behavior 
of  the  individual  neuron  (binary,  analog,  linear,  nonlinear),  the  nature  of  connections 
between  neurons  (complete  interconnections,  layered  approach,  directed),  and  the 
selection  of  learning  algorithms  used.  We  have  chosen  to  use  Hopfield-style  analog 
networks  with  the  neurons  using  sigmoidal  output  functions. 

4  A  Simple  Domain:  Stickville 

We  attempt  recognition  in  Stickville,  a  simple  domain  of  connected  assemblages  of 
directed  linear  “sticks”,  each  possessing  a  base  and  tip  [4],  as  shown  in  Figure  2. 
Ambiguity  exists  only  for  the  initial  root  data  stick,  the  first  point  entered  being 
defined  as  the  base  (circle  at  end  of  stick  in  Fig  2).  All  other  sticks  have  their  base 
located  on  their  parent  stick.  In  Stickville,  we  abstract  objects  by  main  parts,  i.e. 
designated  sticks  defined  in  the  model  base.  Thus  a  plane  is  abstracted  by  a  single 
main  part  fuselage.  The  parameters  of  fuselage  become  the  parameters  of  plane. 

Data  is  represented  by  a  sparse  connection  matrix  inaxi  whose  value  is  unity  if  stick 
i  is  connected  to  stick  j.  The  matrix  ina  is  a  symmetric  matrix,  with  inax]  =  ina^. 
This  is  done  to  allow  any  part  to  assume  the  role  of  a  local  root  stick,  thereby 
becoming  the  main  part.  Models  may  then  match  to  a  connected  subset  of  a  connected 
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assemblage  of  sticks  in  the  data  base.  In  such  a  case,  a  jet  may  be  recognized  even 
though  it  is  attached  to  a  runway ,  which  is  not  in  the  model  base.  A  parent  data 
stick  may  be  connected  to  many  offspring  data  sticks,  but  each  data  stick  must  have 
only  one  parent  data  stick.  This  requirement  disallows  the  sharing  of  a  common  data 
stick  between  two  parent  sticks.  For  example,  a  wing  may  not  be  attached  to  two 
planes.  Thus  each  connected  assemblage  of  sticks  in  the  data  forms  a  tree  structure. 

Appended  to  each  connected  stick  pair  is  a  set  of  three  parameters  PtJ  =  (rt:,  9l},  ql}) 
that  measure,  respectively,  relative  size,  angle,  and  location  of  the  attachment  point, 
Figure  2(c).  Relative  length  rtJ  is  defined  as  log( ),  with  the  range  [— oo,+oc]. 
The  relative  angle  Qi}  is  defined  as  the  positive  acute  angle  separating  i,  j,  the  allow¬ 
able  range  being  [0,7t/2].  By  this  definition,  0tJ  =  6}i.  There  is  some  ambiguity  in 
our  definition  of  0,  since  stick  j  may  assume  4  orientations  with  respect  to  stick  i, 
all  having  the  same  9tJ.  The  relative  attachment  point  qtJ  is  defined  as  the  fractional 
distance  in  terms  of  i  from  the  base  of  i  that  the  base  of  j  is  attached.  The  allowable 
range  is  [0, 1].  For  the  case  where  stick  i  abuts  stick  j,  then  we  set  gtJ  =  0.  The  PtJ 
are  presumed  precomputed  for  the  input  data  set. 

Models  are  specified  in  a  similar  manner  (Figure  2(b)).  A  sparse  binary  matrix 
INAap  is  unity  if  model  stick  0  is  “part  of”  model  stick  a.  For  Stickville,  the  INA 
matrix  encodes  a  tree  structure  (no  shared  parts).  The  matrix  INAag  is  not  symmetric, 
unlike  inatJ.  This  is  because  the  root  part  is  presumed  known,  as  is  the  order  of 
hierarchy  from  parent  to  offspring  sticks.  Implicit  here  is  the  abstraction  of  a  collection 
of  parts  to  a  single  main  part.  Thus  several  stick  models  constituting  an  “airplane” 
may  have  a  single  stick,  “fuselage”,  as  a  main  part.  This  is  shown  in  Figure  2. 

Appended  to  each  INAed  model  pair  is  a  parametrized  function  Fag(Pij )  that 
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Figure  2:  Definitions  in  Stickville. 

specifies  how  well  the  parameters  of  a  stick  pair  ij  fit  requirements  for  models  a  and 
3.  For  Stickville,  F  was  a  simple  quadratic  that  assumed  a  positive  value  of  unity  for 
good  matches  and  negative  values  for  bad  matches.  For  example,  for  relative  size, 


F fuselage,wtng{P 4,1  )  —  (1  (  ( ^~4,1 


r  fu3elage,wing  )  I  &  fuselage  .wing)  2)/(£  INAfusciagerf) 

0 


where  rmfU3elagewing  is  the  nominal  best  value  (entered  by  hand)  and  <Jju,eiage<wing  is  the 
allowable  variation  for  the  relative  size.  For  the  simulations  reported,  o  was  a  constant 
independent  of  a  and  0,  with  (ay,  crg,  crq)  =  (l,7r/2, 1),  respectively.  The  different  a 
scale  their  respective  F  values,  allowing  them  to  be  summed  and  expressed  as  a  single 
value.  The  INA  term  in  the  denominator  normalizes  Fap  with  respect  to  the  number 
of  model  parts,  allowing  models  with  differing  numbers  of  parts  to  compete  equally. 

The  functions  F  are  hand  designed,  but  one  might  suspect  that  F  or  possibly  its 
parameters  could  be  learned  from  training  examples  [9].  Learning  the  database  itself 
would  be  a  much  more  difficult  problem  [10]. 
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5  From  Graph  Matching  to  Model  Matching 

The  objective  function  for  maximizing  consistency  rectangles  may  be  modified  simply 
to  incorporate  the  “quality  of  fit”  as  measured  by  F(P): 

£(M)  =  -EE 

a3  ij 

We  refer  to  this  as  the  “rectangle  rule.”  Figure  3  represents  such  a  consistency 
rectangle  in  Stickville,  with  the  match  neurons  depicted  as  circles.  If  a  stick  pair 
ij  fits  well  with  model  pair  a/I,  thus  having  a  positive  Fa{}{Px]),  the  energy  term 
favors  this  adij  rectangle,  strengthening  both  Max  and  Mpj.  For  bad  fits  that  have  a 

negative  Fag(Px:i),  the  proposed  rectangle  raises  energy  and  is  discouraged,  weakening 
both  Max  and  As  in  the  case  for  exact  graph  matching,  the  rectangle  rule  must 
be  supplemented  by  additional  “syntactical”  terms. 

Stickvil 
Rectan 

•  ■ 

U 

Model  Data 

side  side 

Figure  3:  Stickville  Rectangle. 


Since  the  syntactic  objective  functions  used  in  Stickville  are  somewhat  baroque,  it 
is  instructive  to  first  consider  a  series  of  related  simpler  problems  and  express  appro- 
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priate  objective  functions  for  these.  A  single-level  Stickville  problem,  i.e.  one  whose 
data  term  is  linear  in  M  rather  than  quadratic  as  in  the  rectangle  rule,  consists  of 
the  optimal  selection  of  a  permutation  matrix,  directed  by  the  fit  constraints  which 
select  the  best  match  between  graph  parts.  This  is  known  as  a  linear  weighted  match 
problem,  and  appears  in  combinatorial  optimization  as  the  Task  Assignment  Prob¬ 
lem.  For  this  problem,  we  assume  an  equal  number  of  parts  and  models  and  hence 
a  square  permutation  matrix.  In  Stickville,  the  permutation  requirement  enforces 
unique  matches  from  each  model  stick  to  each  data  stick.  The  driving  energy  term 
in  such  a  problem  is  to  pick  the  best  combination  of  such  selections,  such  as  the 
minimum  total,  while  still  satisfying  the  permutation  matrix.  In  Stickville,  the  mini¬ 
mization  seeks  the  closest  fit  between  parts.  An  example  of  a  minimized  permutation 
matrix  is  shown  in  Figure  4,  where  the  entries  are  numerical  fits. 
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Figure  4:  Minimized  Permutation  Matrix. 

A  single  level  Stickville  network  is  shown  in  Figure  5.  In  the  diagram  at  left,  the 
INA  links  are  shown  as  lines  connecting  model  node  a  to  various  child  nodes  /?;  the 
ina  links  are  shown  as  lines  connecting  parent  stick  i  to  offspring  sticks  j.  Possible 
matches  are  shown  as  lines  connecting  model  and  data  nodes,  with  match  neurons 
(circles)  sitting  on  the  lines. 

To  retain  Stickville  context,  we  show  a  possible  high-level  match  Mat,  but  if  this 
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is  fixed  at  unity,  then  the  problem  reduces  to  the  linear  weighted  match  problem  of 
optimally  assigning  all  child  parts  3  of  parent  a  to  all  child  sticks  j  of  parent  stick 
i.  That  is,  if  Mat ,  INAag,  and  inatJ  in  the  rectangle  rule  are  fixed  at  un;ty,  and  fits 

Fc,3(P,j)  now  depend  only  on  3  and  j  (e.g.  becomes  then  the  rectangle  rule 

becomes  an  objective  function  appropriate  for  the  Task  Assignment  Problem: 

E  = 

3j 


The  dynamical  variables  are  M$r  depicted  as  a  match  matrix.  Three  candidate 
matches  are  shown.  The  left  figure  depicts  all  nine  possible  matches  as  the  lower 
parts  of  rectangles.  Stickville  requires  a  one-to-one  matching  between  model  and 
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Figure  5:  A  single  level  Stickville  matrix. 


data  parts,  namely,  that  each  model  stick  match  one  data  stick,  and  that  each  data 
stick  match  one  model  stick.  In  addition,  each  neuron  should  approach  a  binary  value 
at  the  solution.  The  row  and  column  constraints  needed  to  enforce  these  rules  may 
be  expressed  as: 

£  - 1  = 


0 
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ZA/^-1  =  0 

3 

(Mjj  -  1  )  Mjj  =  0 

In  this  case  the  binary  selection  rule  is  not  strictly  necessary  since  the  problem  is  one 
of  linear  programming.  In  such  a  case,  the  best  answer  to  the  solution  is  guaranteed  to 
be  at  one  of  the  boundaries,  where  A/j,  are  0  or  1,  and  a  linear  programming  algorithm 
finds  it.  (Indeed,  the  use  of  the  binary  constraint  may  be  counterproductive  since  it 
encourages  the  search  of  the  solution  space  only  near  the  boundaries.) 

A  more  difficult  case  involves  trying  to  match  unequal  numbers  of  model  and  data 
sticks,  illustrated  in  Figure  6.  where  Ma,  and  the  ma  and  INA  matrices  are  again  held 
at  unity.  The  linear  version  of  the  rectangle  rule  is  still  the  same,  but  the  syntactical 
constraints  now  differ.  The  permutation  matrix  is  no  longer  square,  and  either  some 
model  or  some  data  sticks  will  not  match.  In  such  cases,  a  constraint  is  needed  to 
allow  the  columns  or  rows  to  sum  to  either  0  or  1.  For  the  row  and  column  constraints 
above,  the  equations  become: 

->)!>/,,,  =  o 

J  0 

=  o 

j  j 

Figure  6  shows  the  appropriate  rectangular  match  matrix  and  a  possible  solution. 
The  left  figure  depicts  all  twelve  possible  matches  as  the  lower  parts  of  rectangle 
structures.  Experiments  show  that  when  the  above  constraints  are  expressed  as  third- 
order  penalty  terms,  globally  correct  matches  are  almost  always  obtained  for  small 
(5  x  14)  matrices. 

We  may  now  escalate  the  problem  by  reintroducing  dynamic  matches  Mai  of 
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parent  sticks  and  models,  and  hence  reinstating  the  rectangle  rule.  First  consider  the 
case,  depicted  by  Fig  5  but  now  with  dynamic  Max ,  with  one  parent  and  equal  numbers 
of  sibling  parts  and  models.  If  we  demand  reasonably  that  a  sibling  match  confidence 
equal  that  of  the  parent  match,  then  the  row/column  constraints  previously  listed  for 
Fig  5  become: 
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INAcoMpj  -  inai}Mai  =  0 
inatJM0]  ~  INAa0Mai  =  0 

j 

The  first  of  these  says  that  several  model  parts  competing  for  one  stick  must  have 
activations  equal  to  the  parent  match.  The  second  of  these  constraints  states  a  similar 
constraint  for  several  child  sticks  competing  for  the  same  model.  The  ina,  INA  terms 
act  as  “filters”  to  ensure  that  this  competition  takes  place  among  the  correct  set  of 
competing  neurons.  (One  could  imagine  a  much  larger  set  of  neurons  than  that  shown 
in  Fig  5,  for  example,  with  the  ones  depicted  being  a  locally  connected,  by  ina  and 
1X4,  subnetwork.)  Of  course,  there  may  be  competing  parent  matches,  (many  a  and 
i  instead  of  one).  The  proper  generalization  of  the  constraints  above  becomes: 

^  ,  IX40g M ,  inatl Mot  —  0 
0  i 

Y^wapMpj -'}TlNAapMai  =  0 

;  a 

Full  Stickville  incorporates  both  object-part  relations  and  matches  between  un¬ 
equal  numbers  of  model  and  data  sticks.  The  row  and  column  rules  above  are  thus 
modified  to  ones  that  are  actually  used  in  the  following  Stickville  simulations: 
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(£  ima0M0j  -  £  inat]Mai)(Y,  ™a0M0])  =  0 

13  .  0 

-'%2lNAa0Ma>){J2inaxjML }J)  =  0 

j  a  j 

The  binary  rule  is  necessary  to  help  suppress  local  minima  within  the  solution  space, 
since  the  problem  is  no  longer  one  of  linear  programming.  Strictly,  the  binary  rule 
should  be  held  as  a  hard  constraint,  though  in  the  simulations  presented  here  adequate 
performance  was  obtained  using  it  as  a  penalty  term. 

The  preceding  constraints  were  of  second-order,  with  respect  to  A/.  To  be  ex¬ 
pressed  as  additive  penalty  terms,  they  must  be  expressed  as  a  third-order  expression. 
This  is  shown  for  the  row  constraint  actually  used  in  Stickville: 

E  =  ( £  ma0M0J  ~'£inat]Ma,)2('E  lNAa0M0]) 

0  <  0 

The  column  constraint  is  expressed  similarly  as: 

£  =  (H  inatjM0j  ~  INAaQMat)2{Y,  inal}M0J) 
i  a  i 

The  energy  curve  of  such  a  third  order  neuron  is  shown  in  Figure  7.  Here  we 
plot  E  vs.  Mqj  for  a  single  M0]  and  INA ,  ina  set  to  unity.  The  horizontal  axis  is  in 
fractional  units  of  Mai  and  the  vertical  axis  is  in  arbitrary  energy  units.  The  neuron 
has  the  tendency  to  seek  the  minima  at  either  0  or  Mai.  Since  Mat  is  itself  changing 
dynamically,  the  minimum  at  Mai  shifts  dynamically. 

Without  a  driving  force,  such  a  Stickville  network  will  be  satisfied  with  all  M  at 
ground  state,  M  =  0.  A  constraint  is  needed  that  requires  the  activation  of  matches 
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Figure  7:  An  illustration  of  a  third-order  energy  curve 

consistent  with  the  minimum  number  of  sticks  and  models.  Such  a  rule  may  be 
expressed  as: 

(£  INAQt3inaX]M0j  -  kM01)  =  0 

Pi 

where  k  is  the  minimum  number  of  unique  matches  between  model  and  data  sticks. 
Note  that  k  may  be  precomputed  and  need  not  be  determined  dynamically.  As  a 
penalty  term,  we  may  simply  square  the  above  expression. 

In  the  simulations  presented  here  the  recognition  of  Stickville  figures  is  not  penal¬ 
ized  for  extra  parts.  They  simply  do  not  contribute  to  any  of  the  energy  or  constraint 
terms.  There  is  a  penalty,  however,  for  missing  parts,  since  the  Fa3(PX})  objective 
function  is  normalized  by  INA.  It  is  possible  to  modify  the  constraint  terms  to  pe¬ 
nalize  model  matches  that  have  additional  parts,  if  it  is  assumed  that  an  extra  part 
should  detract  from  the  quality  of  match. 
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6  Introduction  of  a  Specialization  Hierarchy 

Indexing  into  a  large  database  of  modeis  may  be  made  efficient  by  the  introduction 
of  a  specialization  (754)  hierarchy.  Efficiency  is  achieved  by  inheritance  of  objective 
functions;  work  done  in  matching  a  parent  model  need  not  be  repeated  in  match¬ 
ing  a  specialization  of  that  model,  unless  a  restriction  of  parent  model  parameters 
is  included  as  part  of  that  specialization.  The  figure  below  shows  a  specialization 
hierarchy  for  a  set  of  models  in  Stickville.  Specialization  is  achieved  by  narrowing 
the  allowed  ranges  of  parameters  between  parts,  and  by  the  addition  of  new  parts. 
Thus,  in  the  figure  below,  a  plane  model  becomes  a  jet  model  by  narrowing  the  wing- 
fuselage  angle  so  that  the  wing  is  swept  back.  In  a  Idition,  the  jet  model  contains 
engines  and  elevators  not  necessary  for  the  plane  model.  For  the  Stickville  simulations 
presented  here,  we  tested  only  the  case  where  the  addition  of  new  parts  was  required 
for  specialization.  The  energy  functions  here  are  general,  however,  and  handle  both 
cases. 

We  define  a  fixed  sparse  binary  matrix  ISAa0  whose  value  i  nity  if  model  (3  is  a 
specialization  of  model  a.  Our  ISA  matrix  forms  a  tree  structure.  Models  are  defined 
as  the  entire  subset  of  model  sticks  that  uniquely  comprise  a  figure,  such  as  a  jet. 
Thus,  in  Figure  8,  ISApiane>jet  -  1  and  clearly  754p;anej/,orje  =  0.  The  ISA  matrix  is  not 
symmetric. 

6.1  An  Objective  Function  for  Specialization 

The  matching  matrix  and  the  specialization  {ISA)  matrix  may  be  combined  simply:  if 
an  assemblage  t  of  sticks  matches  a  model  a  through  the  matching  matrix,  then  the 
matches  between  the  assemblage  and  each  of  the  model’s  specializations  is  enhanced. 
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Figure  8:  Typical  Specialization  hierarchy  for  Stickville 

Furthermore,  the  specializations  {3  compete,  on  the  basis  of  detailed  structure  not 
present  in  the  general  model  a,  to  become  the  unique  specialized  match  to  i.  The 
original  match  between  o  and  i  remains.  Thus,  the  intent  is  to  have  the  specialization 
hierarchy  serve  also  as  a  discrimination  tree. 

Figure  9  illustrates:  model  a  matches  stick  t.  The  matches  of  sibling  models  fi 
and  S'  to  the  same  stick  are  enhanced,  but  only  to  the  level  of  the  parent  match  Mat. 
They  also  compete  with  one  another;  the  sibling  that  finds  greater  support  from  its 
parts  eventually  wins.  In  steady  state,  the  match  Mai  and  only  one  of  the  other  two 
matches  to  i  remains.  The  structure  can  also  illustrate  “bottom-up”  indexing.  If  one 
of  the  sibling  matches,  say  M0j,  is  on,  then  the  parent  match  Mat  becomes  active  also, 
if  it  had  been  initially  set  at  a  low  value.  This  bottom-up  aspect  does  not  correspond 
to  the  usual  notion  of  a  discrimination  tree. 

A  constraint  that  achieves  all  of  these  effects,  while  retaining  the  possibility  of  no 


(£  ISAa0M0i  -  A/a,)(£  IS\a0M0i)  =  0 
o  0 


match,  is  the  term 
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It  is  not  necessary  to  normalize  this  function  since  collections  of  model  sticks  compete 
as  models,  not  individual  sticks.  If  the  latter  were  the  case,  it  would  be  necessary  to 
normalize  the  competition  rule  to  allow  for  equal  competition  between  models  with 
different  numbers  of  parts. 


a 


Figure  9:  ISA  specialization  structure 


The  interaction  between  INA  and  ISA  is  captured  in  an  aISA  INA ”  diamond.  An 
example  of  the  ISA  INA  diamond  structure  is  shown  in  Figure  10.  A  consistent 
rectangle  structure  is  formed  between  plane,  wing  and  the  data  sticks  (only  two  data 
sticks  depicted  here),  as  well  as  between  prop, prop  wing  and  the  same  data  sticks.  If 
the  plane  and  wing  sticks  match,  then  the  specializations  prop  (for  “ propellor  plane") 
and  prop  wing  are  encouraged  by  the  ISA  energy  term.  Verification  is  ensured  by  the 
prop  —  prop  wing  —  stick  —  stick  rectangle. 


6.2  Competition  between  Groups  of  Sticks 

An  alternate  form  of  /S4-competition  was  experimented  with.  Instead  of  match¬ 
ing  individual  sticks  to  model  nodes  and  organizing  ISA-INA  as  in  the  “diamond” 
structure,  we  may  match  collections  of  ina- connected  sticks  directly  to  collections  of 
INA- connected  models  as  shown  in  the  example  in  Figure  11. 


*  m™ODVCT,ON  OF  A  SPECIALIZATION 


HIERARCHY 


Prop 


Plane 


ling  M 


-  V-/ 

Prop  wing  M 

Figure  10:  ISi  1NA  structure 

e.g  PLANE  a  r’m0de,S  ‘hiS  a,‘erMte  SCheme'  are  indicated  i„  caps, 

so  (  r  m  ^  -  ™~d  models  „f  the  former 

7  dS  “  C“  ^  ^  «  sparse  binary  matrices 

^  -* is  * — — — ^ 

r-stid  '  ^  7  '*  COmP<>Sed  °f  StiCkS  16  -  33  -  44  in  the  example  shown.  The  index  of 

8  rangeS  °Ver  a11  COnnected  assemblages  m  the  data  Matches  a 

by  neur°ns  R„,  (instead  of  M  )  where  th  '  d-  '  rePreSe”ted 

r-sticks  The  „•  d  “  *  *  ”  and 

•  diamond  organ, ration  is  gone,  hut  the  triangle  competition  remains: 

<L  msa„.,ri%  _  n,„>(£ftmr„>  =  0 

P 

Our  motivation  of  RIS\  vs  ISA  r 

in  Section  8.  TOm  * d~rib«'  “  d«ai, 

^  "W  “  ""  *  —  “  —  -  *  ones  unambiguously  Let , 

range  over  all  sticks  that  comprise  r-s«ick  f  and  a 

'  “d  0  ran*e  all  models  that  comprise 
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r-model  0' .  Then, 

Ro’,’  ~  Mot  (1) 

0* 

Also,  RISA  may  be  determined  from  ISA.  Let  a  range  over  models  comprising  r-models 
a'  and  0  range  over  models  comprising  r-model  0'.  Then  if 

IS\a0  =  1,  RISAa.0.  =  1.  (2) 

For  example,  if  ISAwing,prop  w,ng  =  1  as  in  the  diamond  figure,  and  wing,  prop  wing  are 
elements  of  r-models  PLANE  and  PROP,  respectively,  we  conclude  the  RISAp lane. prop 
1. 

7  Unconstrained  Optimization  with  Hopfield  Nets 


Constraints  can  be  conveniently  reformulated  into  the  framework  of  unconstrained 
optimization  using  additive  “penalty”  terms.  This  is  the  approach  used  by  Hopfield 
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in  the  TSP  [2].  For  example,  the  constraint  of  sibling  competition  discussed  above 
becomes  the  additive  term, 

4E  ISiaeM0l  -  IS UM,,,) 

where  c  is  a  coefficient  that  determines  the  strength  of  this  penalty  term.  The  minima 
of  this  term  coincides  with  the  desired  constraint.  A  major  difficulty  with  such  penalty 
terms  is  in  the  selection  of  c.  A  larger  c  enforces  the  constraint  more  rigidly,  but  at 
the  expense  of  other  penalty  terms. 

Dynamic  variables  are  identified  with  neurons,  and  the  connection  weights  are 
specified  by  the  objective  function.  Hopfield  [1]  has  shown  that  descent  to  a  local 
minimum  of  the  objective  function  follows  from  the  equations  of  motion, 


dE 
dMai 

$(««•) 

M  and  an  internal  state  u. 

8  Constrained  Optimization  with  Hopfield  Nets 

Unconstrained  optimization  problems  offer  the  advantages  of  simplicity  and  the  ability 
of  direct  implementation  in  Hopfield  networks.  In  certain  situations,  they  possess  the 
advantage  of  seeking  compromises  between  constraints.  However,  they  possess  the 
undesirable  properties  of  not  exactly  satisfying  the  original  constraint,  needing  to 
find  suitable  weighting  factors  in  relation  to  the  other  constraints,  and  undesirable 
convergence  rates  as  the  constraint  strength  is  increased. 


d\iai 

nr 

Ma>  = 

where  g  is  the  sigmoidal  mapping  between 
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It  is  possible  to  solve  the  constrained  problem  exactly  using  the  method  of  La¬ 
grange  multipliers  [5].  Hard  constraints  may  be  expressed  in  the  form  h  =  0.  A 
network  implementation  is  possible  [6]  where  the  Lagrange  multipliers  A  are  them¬ 
selves  neurons.  The  equations  of  motion  are, 


dAf'yk 

6E  dhat 

8M,k  tr'a'<9A/,fc 

(3) 

dt 

dt 

hat  (  ) 

(4) 

where  now  we  use  double  subscripting  ai  to  index  hard  constraints  hai  and  associated 
Lagrange  neurons  AQt.  The  term  E  includes  the  objective  and  remaining  penalty 
terms. 

A  gradient  ascent  is  performed  on  the  Lagrange  neurons,  while  a  steepest  gradient 
descent  is  performed  on  M  [6j.  The  method  may  be  modified  slightly  to  ensure 
positive  definiteness  and  convergence  around  the  minima  [6],  The  method  allows  the 
constraint  h{M)  to  be  held  exactly  at  the  solution  point  and  also  relaxes  the  selection 
of  constraint  strengths. 

As  stated  above,  the  constrained  optimization  equations  have  no  provision  for 
the  analog  gain  “barrier”  term.  A  slight  modification  leads  to  a  constrained  form  of 
Hopfield  dynamics: 

3E  i  '  .  Ohai 

~dT  =  a'dM7k 

Mi  k  =  g(u* yk) 

One  practical  consideration  is  that  each  of  our  penalty  constraints  may  be  writ- 
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ten  as  a  sum  of  (at  most  cubic)  monomial  terms.  The  sums  determine  connection 
weights  between  neurons,  but  no  new  neurons  need  be  introduced  even  if  there  are 
many  terms  in  the  sum.  On  the  other  hand,  reformulation  with  Lagrange  multipli¬ 
ers  requires  one  Lagrange  neuron  for  each  term  in  the  constraint  sum,  leading  to  a 
proliferation  of  Lagrange  neurons.  For  objective  functions  with  a  limited  number  of 
constraints,  such  as  the  ISA  fanout  rule,  the  overhead  of  additional  Lagrange  neurons 
and  connections  is  manageable.  For  objective  functions  with  multiple  constraints, 
such  as  the  binary- value  objective  function,  a  Lagrange  neuron  is  needed  for  each 
match  neuron.  Such  a  requirement  doubles  both  the  number  of  connections  and  the 
number  of  neurons.  Depending  on  the  expense  of  implementing  connections  and  neu¬ 
rons,  which  itself  depends  on  the  neural  network  implementation  method,  such  a  cost 
may  be  prohibitive.  On  the  other  hand,  it  could  be  argued  that  the  network  already 
has  0{N2)  neurons  for  a  database  of  O(N)  parts  or  models,  so  a  doubling  due  to 
Lagrange  neurons  is  not  relatively  expensive. 

Stickville  was  implemented  using  Lagrange  neurons  to  enforce  the  ISA  fanout  rule. 
This  use  of  Lagrange  neurons  led  to  a  modest  increase  in  the  number  of  neurons  and 
connections  in  our  simulations.  In  this  case  there  is  one  hard  constraint  for  every 
occurrence  of  ISA  sibling  competition  as  in  Figure  9.  The  hard  constraint  demands 
that  at  the  fixed  point,  one  or  none  ISA  siblings  shall  be  on  (equal  to  the  activation 
of  parent),  and  all  of  the  others  off  (equal  to  0).  We  may  index  the  constraint  by  the 
indices  of  the  parent  neuron  ai  and  thus  write  the  constraint  as: 

hat(Myk)  =  ha,{Mai ,  ISA  siblings  of  Mai)  =  (£  ISAaSM0t  -  Mat)(]T  ISAa0Mg,)  =  0 

0  0 

(5) 

The  equations  of  motion  then  become: 
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du^k  d E  .  dhai 

dt  dM^k  ~  a'dM~,k 

=  (E  IMa0M0l  -  A/Qt)(E  IS\a0M0>) 
at  0  0 


X- 

II 

sKfc) 

where, 

dhot 

d\U 

=  (2E^3i- 
0 

A/ol)(E/^) 

0 

if  ^  A Iax 

dha, 

d\U 

=  E  ISA^Moi 

0 

if  M^k  —  Ma{ 

dhax 

d.\U 

=  0 

if  Myk  neither  child ,  parent  of  ai 

A  similar  constraint  may  be  formulated  for  the  R1SA  links.  By  analogy  with  Eq.  5, 
it  is. 


ha'i'iFC'k')  =  RISA  siblings  of  Ra't')  =  (E  RISAa>Q>R/3»i'—Ra’i,)('5~'.  RISAQ'3<Ra>i>) 

0'  0’ 


By  Eqs  1  and  2,  the  above  constraint  may  be  re-expressed  in  terms  of  the  dynamical 
variables  M .  The  equations  of  motion  follow  analogously  from  Eqs  3  and  4. 

Let  primed  quantities  ex  R'  index  r-models  and  RISA  links,  and  let  i'  index  r- 
sticks.  Also  let  the  corresponding  unprimed  quantities  [a,0,i)  index  elements  of 
models,  sticks  that  comprise  the  correct  r-models,  r-sticks.  Then  we  may  write  ha>x> 
in  terms  of  match  neurons: 


A„v< M)  =  (£  E'SUMj.  -  T.  V..KE  ISivoMn)  =  0 

ata1  tft*  ata*  ata* 

0i0‘  itx'  0t0' 
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From  the  form  of  the  constraint,  it  is  seen  that  entire  groups  of  neurons  compete 
rather  than  individual  ZS4-siblings. 

The  equations  of  motion  become: 


where. 


dhQ y 

<u/,* 


(2  e  E  fswtf*  -  E  '«„,)(  E  M„.)(EEISi^ 

ata*  Ml1  Qta1  aea'  vo1 

3t3'  iti'  0t3‘  3i3‘ 


./  AWMai 
aea',  iV' 


d.\I,h 


E  E 

ota1 

fod9 


dha>l< 

dM^k 


0 


,/  M,„  =  Mal 
aca',  m' 


The  above  dynamics  are  complicated,  but  may  be  understood  more  easily  with 
the  aid  of  Figs.  9  and  11.  In  Fig  9  individual  neurons  that  match  the  same  stick 
and  are  also  ISA  siblings  compete  via  the  constraint  ha,  of  Eq  5.  Strictly  speaking, 
the  constraint  would  be  satisfied  if  the  sum  of  the  neurons  equaled  unity,  but  other 
constraints  force  one  (or  none)  siblings  to  the  value  of  the  parent  neuron  and  the  others 
to  zero.  In  Fig  11,  the  Ra>,>  neurons  represent  the  sum  of  a  large  cross  product,  i.e. 
Ra'i'  =  Ylata'  Utec  Mat.  The  sibling  cross  products  ifyv  compete  as  sums.  Now,  other 
constraints  force  (i)  all  neurons-  M  in  all  cross  products  to  zero  except  for  one  cross 
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product  and  (ii)  the  correct  subpermutation  within  the  winning  cross  product.  As 
seen  in  the  next  section,  the  latter  method  with  r-models  often  led  to  more  successful 
results. 

9  Implementing  Learning 

Up  until  this  point  the  ‘'fit”  function  Fee  has  been  a  prespecified  function.  It  is  a 
function  which  encourages  close  matches,  and  discourages  mismatches.  For  Stickville, 
F  took  the  form, 

Ffuselage.wtngi  P\,\ )  =  (1  —  ( ( r4 .1  —  r  j  uselage.wtng)  I  **)  )/(^Z  fuselage, &) 

0 

where  a  was  a  constant  value.  What  this  equation  states  is  that  variation  of  a  part  of 
the  data  compared  to  the  model  is  judged  by  a  predetermined  notion  of  good  fit.  In 
addition,  the  same  rule  is  applied  without  variation  to  all  data  parts.  This  limitation 
may  be  removed  by  employing  a  family  of  parameters  crtJ,  though  again  the  set  of 
parameters  is  predetermined.  An  additional  limitation  of  this  equation  is  that  it  is  a 
unimodular  function,  generating  only  one  maximia.  If  we  desire  that  F  is  capable  of 
matching  two  or  more  discrete  examples  (  e.g.  an  acute  angle  and  an  obtuse  angle) 
then  F  is  required  to  be  a  multimodal  function. 

A  more  desirable  solution  would  be  to  learn  the  individual  variation  rules  from  a 
series  of  training  data.  The  training  data  would  be  labelled  as  a  “good”  or  "bad"  fit 
and  allow  the  neural  network  to  adjust  the  specific  multimodal  Fai}. 

To  implement  the  learning  algorithm  I  have  used  a  modified  back-propagation 
technique  [36].  We  wish  the  untrained  network  to  reject  all  matches  by  producing  a 
near- zero  output.  Trained  positive  examples  should  produce  a  near-one  output,  sig- 
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Figure  12:  Gaussian  output  function 


nifying  a  good  match.  Examples  within  the  local  neighborhood  of  a  positive  example 
should  produce  an  intermediate  response  (0,1)  which  rapidly  tails  off  towards  zero  as 
the  difference  is  increased.  Since  the  desired  output  function  on  a  local  scale  can  be 
approximated  by  a  Gaussian  function,  the  output  function  of  the  individual  neurons 
composing  the  learning  network  were  chosen  to  be  a  Gaussian  function  of  the  form, 


fg(x)  =  we~^x~x)i 

This  generates  a  Gaussian  function  centered  on  A,  with  a  width  determined  by  rj,  and 
a  magnitude  determined  by  w,  as  shown  in  figure  12. 

Back-propagation  techniques  are  usually  applied  to  networks  with  monotonic  ( e.g . 
sigmoidal)  output  functions  only.  Moody  has  shown  that  similar  equations  may  be 
applied  to  networks  with  hidden-layer  Gaussian  neurons  [37].  These  networks  retain 
the  gradient-descent  nature  of  the  back-propagation  networks. 

A  neural  network  may  be  created  with  an  input  neuron  i,  a  set  of  Gaussian 
neurons  in  the  hidden  layer  j,  and  a  sigmoidal  output  neuron  o  shown  in  figure  13. 
The  sigmoidal  output  neuron  is  used  to  limit  the  total  output  of  the  network  and 
maps  its  input  to  output  through  the  equation: 


9  IMPLEMENTING  LEARNING 


45 


Output  Layer 
Hidden  Layer 
Input  Layer 


Figure  13:  Gaussian  neural  network 


The  input  to  this  network  is  a  parameter  between  an  I N Ae d  pair  of  model  parts 
q3  and  a  matched  inaed  pair  ij.  The  desired  output  is  the  function  Fap.  Training 
pairs  consist  of  an  observed  parameter  and  whether  that  is  allowable  (output  1)  or 
not  allowable  (output  0).  A  back-propagation  technique  is  used  to  reduce  the  error 
signal  generated  at  the  output  by  adjusting  the  parameter  w.  The  parameters  rj,  A 
are  adjusted  with  reference  to  the  input  data  to  obtain  an  optimum  clustering  of  the 
hidden  layer  neurons. 

Within  the  learning  subnetwork  each  hidden  layer  neuron  j  generates  a  Gaussian 
output.  The  aggregate  output  for  a  group  of  hidden  layer  neurons  j  takes  the  form: 


net  =  ur,0e-T7j*r 

j 


where  ivJ0  is  the  weighting  parameter  between  hidden  layer  neuron  j  and  the  output 
neuron  o.  This  output  then  becomes  the  input  net  for  the  single  sigmoidal  output 
neuron  o. 

The  learning  networks  are  designed  with  several  hidden  layer  neurons  in  order  to 
fine  tune  the  output  function  F.  Parameters  wJO,r],\  exist  for  each  of  the  hidden 
layer  neurons  j,  where  o  is  the  output  neuron. 

The  application  of  the  back  propagation  equations  involves  two  phases.  The  first 
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phase  consists  of  presenting  an  input  to  the  network  and  propagating  it  forward  to 
compute  the  observed  output  o0  and  the  generated  error  signal  between  the  desired 
output  and  the  observed  output.  The  second  phase  consists  of  a  backward  pass 
through  the  network,  propagating  the  error  signal  through  each  layer  of  the  network 
while  modifying  the  neuron  parameters  to  reduce  the  error  signal. 

The  basic  back  propagation  equation  to  modify  w  takes  the  form: 

Au?Xy  —  Oj- 

This  equation  states  that  the  weight  tv  between  any  units  x,y  should  be  changed 
by  an  amount  proportional  to  the  error  signal  6y  at  the  receiving  neuron  and  the 
output  of  the  sending  neuron  ox.  /  is  a  parameter  that  adjusts  the  rate  of  learning. 
The  parameter  is  chosen  to  allow  a  resonable  rate  of  learning  while  avoiding  the 
possibility  of  oscillation  of  the  network  with  large  values  of  /.  For  weights  between 
the  hidden  layer  neurons  j  and  the  output  neuron  o,  the  equation  takes  the  form: 

A  w]0  =  l8aOj 

The  determination  of  the  error  signal  starts  at  the  output  layer.  For  neurons  in 
the  output  layer,  6  takes  the  form: 

80  —  (t0  Oq')  j $(uet0) 

where  fs{net0)  is  the  derivative  of  the  activation  function  which  maps  the  total  input 
netQ  of  a  neuron  to  its  output  value,  in  this  case  the  derivative  of  a  sigmoidal  mapping 
function.  The  term  (t0  -  oa)  is  the  difference  between  the  desired  output  and  the 
generated  output  for  a  given  input  pattern. 

In  order  to  limit  the  response  of  the  hidden  Gaussian  neurons  to  achieve  local 
responses,  the  last  equation  is  modified  to  the  form: 
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60  =  (t0  -  o0)f'a(net0)fa(net0) 

This  equation,  graphically  displayed  in  Figure  14  favors  local  responses  while  sup¬ 
pressing  non-local  responses.  In  the  figure,  the  closer  Gaussian  response  is  modified 
in  preference  to  the  remote  Gaussian  response. 

The  previous  equations  adjust  the  magnitude  of  the  Gaussian  neurons  to  match 
the  test  data.  The  equations  to  modify  the  offset  and  width  of  the  Gaussian  involve 
a  feed-forward  approach  to  adjust  rj,  A  of  the  hidden  neurons.  For  A,  the  error  term  is 
the  difference  between  the  input  parameter  x  and  the  individual  neurons  A’s.  Changes 
in  the  A_,'s  are  again  gated  by  the  response  of  the  individual  Gaussian  neuron  to  the 
input,  /(x),  or  equivalently,  f(net).  The  equations  for  r?  are  similar  in  nature.  The 
equations  are: 


X\j  =  IS 3 

s}  =  (X  -  A, •)£(*)/,(*) 

Xrjj  =  16 j 

S]  =  (x-1h)fg(x)fg(x) 


10  Experimental  Results 

A  set  of  selected  experimental  results  from  Stickville  appear  in  Figures  15  through  22. 
For  the  following  simulations,  analog  neural  networks  were  digitally  simulated,  using 
discrete  time  steps.  The  forward  Euler  method  was  used  to  simulate  the  differential 
equations. 
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Figure  14:  Gaussian  function 


10.1  Non-Learning  Results 

In  the  following  examples,  the  ISA  fanout  rule  is  enforced  as  a  hard  constraint,  the  La¬ 
grange  neurons  appearing  as  “Lagrange  ISA  Constraints.”  The  additional  constraint 
functions,  such  as  binary  values  and  one-to-one  matching,  are  handled  as  additive 
penalty  terms.  While  the  performance  of  the  network  was  sensitive  to  the  initial 
selection  of  the  penalty  weights,  the  weights  did  not  need  to  be  adjusted  to  solve 
individual  problems.  The  penalty  weights  used  in  the  following  simulations  were: 

row,  column  rules  =  1 
binary  value  ru.e  =  1 

total  activation  rule  =  0.1 
rectangle  rule  =  5 
analog  gain  term  =  1 

Unless  otherwise  noted,  the  network  simulations  were  started  with  the  set  to 
small  random  values  on  the  range  [.01,  .03]. 

In  the  simulations  depicted  in  the  figures,  there  is  one  Lagrange  neuron  per  ISA 
structure.  The  leftmost  neuron  tefers  to  the  ISA  structure  higher  up  in  the  hierarchy. 
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We  note  that  the  Lagrange  neurons  are  mapped  sigmoidally  for  display  (but  not 
computational)  purposes. 

Figure  15  illustrates  a  basic  incremental  match  using  ISA  links.  In  this  example, 
the  model  consists  of  a  root ,  which  specializes  to  a  plane ,  which  then  specializes  to  a 
prop  and  a  jet.  ISA  neurons  are  needed  for  each  part  at  each  level  of  specialization. 
Specifically,  there  is  a  set  of  5  Lagrange  neurons  for  the  specialization  from  root 
to  plane,  1  for  each  data  stick.  There  are  also  sets  of  Lagrange  neurons  for  the 
specialization  of  plane  to  jet  and  prop,  left  wing  to  jet  left  wing  and  prop  left  wing, 
and  right  wing  to  jet  right  wing  and  prop  right  wing.  In  the  figure,  the  i  index  of 
the  Lagrange  neuron  Aa,  is  given  by  its  column  location.  The  a  inaex  proceeds  down 
the  hierarchy:  first  row  is  a  =  root,  etc.  Figure  15(b)  shows  the  network  after  it 
has  settled  on  the  solution  of  prop.  At  the  solution  point,  all  data  sticks  match  to 
root,  some  match  to  plane  in  a  specified  manner,  and  all  match  to  prop  in  a  specified 
manner. 

Figure  16  points  out  a  weakness  of  the  ISA  rule.  The  rule  relies  only  on  local 
information  in  its  decision.  For  example,  the  selection  between  jet  left  wing  and 
prop  left  wing,  and  the  selection  between  jet  left  tail  and  prop  left  tail  are  linked 
only  indirectly  by  the  rectangle  rule.  Figure  16(b)  shows  a  decision  reached  that  is 
consistent  with  the  ISA  rule,  but  is  not  a  valid  terminal  node  of  the  model.  This 
inconsistency  is  avoided  if  the  ISA  rule  is  reformulated  as  the  RISA  rule,  where  the 
competition  is  between  the  sets  of  parts  forming  a  model,  and  not  its  individual  sticks. 
All  the  following  experiments  are  performed  using  the  RISA  rule.  The  RISA  neurons 
are  displayed  in  the  same  manner  described  for  Figure  15. 

Figure  17  illustrates  a  basic  incremental  match  using  the  RISA  rule.  The  data  is 
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a  connected  assemblage  of  sticks  (an  r-stick)  that  perfectly  matches  a  jet  model  in 
this  case.  The  individual  models  label  the  rows  of  the  match  matrix;  the  appropriate 
r-models  (and  their  constituents)  are  ROOT  (root),  PLANE  (plane,  left  wing,  right 
wing),  JET  (jet,  jet  left  wing,  jet  right  wing,  jet  left  engine,  jet  right  engine),  and 
PROP  (prop,  prop  left  wing,  prop  right  wing,  prop  left  tail,  prop  right  tail).  The  RISA 
links  are  RISAroot, plane,  RISAplane.jet ,  and  RISApiane.prop •  The  latter  two 
compete  while  the  former  is  its  own  competition.  If  we  label  the  sole  r-stick  as,  for 
example,  1,  then  the  only  Lagrange  neurons  are  \root,x  and  \plase,i-  These  are 
displayed  at  the  bottom  of  the  figure. 

The  figure  illustrates  the  system  in  terms  of  models,  sticks,  and  M  neurons;  the  r- 
versions  are  implicit  but  defined  as  above.  At  first,  all  sticks  match  root  hence  ROOT. 
The  RISAroot.pl an e  link  encourages  a  match  to  PLANE.  Several  components  of  r- 
stick  1,  guided  by  the  rectangle  rule,  find  matches  to  components  of  r-model  PLANE. 
Specifically,  sticks  0,1,2  match  plane,  left  wing  and  right  wing.  This  in  turn  activates 
RISA  competitors  PROP  and  JET.  Through  the  rectangle  rule,  they  both  seek  appro¬ 
priate  matches  among  the  sticks  of  r-stick  1.  Eventually,  JET  finds  greater  support; 
PROP  loses  out  (thus  satisfying  the  RISA  constraint)  and  the  network  reaches  a  fixed 
point  as  shown  in  Fig  17(c).  Among  the  possible  matches  to  JET,  the  rectangle 
rule  and  syntactic  constraints  select  the  correct  permutation  matrix,  i.e.  the  correct 
matches.  In  Fig  17(b),  intermediate  results  are  shown.  Curiously,  r-model  PROP  is 
better  matched  than  JET  at  that  point  in  time,  but  loses  out  in  the  end. 

Figure  18  shows  a  net  starting  from  a  high-energy  undesirable  starting  position. 
In  this  case,  the  match  neurons  from  PLANE  RISA  siblings  prop  and  JET  have  been 
set  to  high  value  in  matching  the  same  stick.  This  creates  an  unfavorable  energy 
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term  from  the  RISA  fanout  rule.  The  Lagrange  neuron  responsible  for  enforcing  this 
constraint  performs  a  gradient  ascent  until  the  rule  is  enforced  by  suppressing  the 
matches  to  prop. 

Figure  19  shows  that  the  network  is  capable  of  performing  non-terminal  matches. 
In  this  case,  the  data  presented  to  the  network  is  simply  a  PLANE  without  spe¬ 
cializations.  The  RISA  links  do  not  proceed  past  PLANE  since  the  fanout  rule  was 
formulated  as  being  of  low  energy  when  the  fanout  was  Mai  or  0.  This  creates  a 
third-order  rule  but  allows  the  existence  of  the  two  local  minima  needed  to  account 
for  the  case  of  a  solution  as  well  as  no  solution. 

Figure  20  shows  that  the  network  can  recognize  objects  as  connected  subsets  of 
connected  assemblages  of  sticks.  In  the  case  presented,  the  model  to  be  recognized  is 
a  JET.  The  data  base  clearly  contains  a  JET  attached  to  an  extraneous  stick.  The 
JET  is  recognized,  while  the  match  to  its  parent  stick  is  not. 

The  next  few  figures  show  how  RISA  hierarchies  may  help  speed  recognition.  As  a 
minimal  requirement,  one  might  demand  that  a  model,  such  as  JET ,  may  be  found 
more  quickly  if  its  RISA  parent  PLANE  is  already  matched.  This  is  indeed  the 
case  as  seen  in  Figure  21.  Here,  the  network  is  initialized  with  correct  matches  to 
ROOT  and  PLANE  already  set;  the  JET  is  found  correctly  in  20  time  steps.  If 
the  network  is  initialized  to  a  random  start,  however,  correct  recognition  of  JET 
takes  considerably  longer  (70  time  steps)  as  expected.  Note  here  that  parts  of  JET 
common  to  PLANE ,  such  as  the  wings,  are  rematched  as  JET  is  recognized.  The 
RISA  hierarchy  is  still  efficient  though  since  the  reverification  of  wing  in  a  JET 
context  proceeds  more  quickly  when  the  (easier)  match  of  wing  in  PLANE  context 


is  present. 
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In  the  next  few  simulations,  a  database  of  ROOT  PLANE  JET  is  used  to  test 
for  speed  comparisons.  Table  1  summarizes  results.  Here  r, p,j  stand  for  ROOT 
PLANE  JET ,  and  a  capitalized  version  R,P,J  means  that  the  net  is  initialized 
with  the  capitalized  model  correctly  matched.  The  network  starting  positions  for  Rp 
and  Rj  are  shown  in  Figure  22.  The  timing  values  then  give  the  number  of  time  steps 
needed  for  the  net  to  stabilize.  Thus,  RPj  =  20  implies  that  the  net  is  started  with 
ROOT  PLANE  set  and  it  takes  20  time  steps  to  find  JET.  Also,  Rj  =  55  means 
that  only  ROOT  JET  are  in  the  model  base,  and  55  steps  are  needed  to  find  JET 
if  ROOT  is  present.  We  tested  the  following  cases:  RPj,  Rpj,  Rj,  Rp. 

RPj  20 
Rpj  50 
Rj  55 
Rp  45 

Table  1:  Timing  Results 

Some  observations  are  in  order:  As  expected,  RPj  =  20  is  much  less  than  Rpj  = 
50;  this  is  essentially  the  same  result  shown  in  Figure  21.  In  addition,  Rpj  =  50  is 
less  than  Rj  =  55,  so  the  introduction  of  an  intermediate  model  speeds  recognition, 
albeit  slightly,  in  this  case.  Finally  it  is  interesting  to  compare  the  time  for  a  fully 
parallel  search,  Rpj,  with  a  “serial”  search  in  which  a  net  first  finds  PLANE  only, 
then  uses  the  result  as  an  initial  condition  to  find  JET.  The  appropriate  comparisons 
are  Rpj  =  50  vs.  Rp  +  RPj  =  65,  so  the  parallel  one  wins  here. 
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Figure  15:  ISA  Incremental  Match. 

The  net  in  (b)  has  settled  on  a  solution  to  the  graph  matching  problem 
in  (a).  This  figure  illustrates  the  use  of  Lagrange  neurons  to  enforce  the 
ISA  constraints. 
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Figure  16:  ISA  Problem. 

The  net  in  (b)  has  reached  a  local  minima  that  is  not  consistent  with  a 
valid  selection  of  one  of  the  terminal  models.  This  solution,  however,  is 
consistent  with  the  ISA  rule,  which  requires  a  decision  of  which  part  to 
activate  on  a  local  level  only.  This  inconsistency  does  not  occur  when  the 
RISA  rule  is  used. 
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Figure  17:  RISA  Incremental  Mr  .ch. 


The  net  in  (c)  has  settled  on  a  solution  to  the  graph  matching  problem 
in  (a).  An  intermediate  point  in  searching  for  the  solution  is  illustrated 
in  (h),  where  appropriate  matches  are  not  yet  fully  activated  and  several 
inappropriate  matches  are  active.  The  RISA  selection  is  guided  by  the 
unique  features  of  the  specialization.  This  figure  also  illustrates  the  use 
of  Lagrange  neurons  to  enforce  the  RISA  constraints. 
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rigure  18:  RISA  hierarchy. 


The  net  in  (b),  a  representative  of  the  graph  matching  problem  illustrated 
in  (a),  is  started  with  the  data  sticks  matching  to  both  the  JET  and  the 
PROP,  a  violation  of  the  RIS\  hierarchy  rules.  In  (c),  the  Lagrange 
neuron  responsible  for  satisfying  the  RIS\  constraint  exactly  increases, 
eventually  forcing  the  matches  to  PROP  parts  to  disappear. 
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Figure  19:  Ncn-Terminal  Match. 
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It  is  not  necessary  to  match  a  terminal  model  of  the  RISA  hierarchy.  The 
net  in  (a)  has  matched  the  data  in  (b)  to  PLANE  only,  satisfying  the 
RISA  hard  constraints. 
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Figure  20:  Sub-Match. 

It  is  possible  to  match  a  subset  of  a  connected  assemblage  of  data  sticks 
to  a  model.  Here,  the  model  JET  is  recognized  in  a  subsection  of  the 
data  base.  There  is  no  specialization  hierarchy  here. 
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Figure  21:  Partial  Information. 


The  matching  problem  in  Figure  18  is  started  with  the  M  match  neurons 
to  ROOT  and  to  the  subparts  of  PLANE  set  to  active.  The  network 
quickly  settles  on  the  correct  solution  of  jet  within  20  time  steps,  being 
driven  by  the  RIS\  link  from  PLAN E.  When  a  JET  is  matched  to  JET 
with  ROOT  but  not  PLANE  set,  not  depicted,  the  network  requires  50 
time  steps  to  converge  to  a  solution. 
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Figure  22:  Initial  Starting  Positions. 

Both  (a)  and  (b)  show  the  initial  starting  positions  for  the  models 
PLANE  and  JET.  Both  are  being  matched  to  a  data  base  contain¬ 
ing  a  JET ,  and  are  started  with  the  matches  to  ROOT  set  to  active,  Rp 
and  Rj. 
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10.2  Learning  Results 

In  this  section  learning  of  the  fit  parameters  F  was  implemented.  The  fixed  equation 
previously  employed, 

Ff  uselage  ,wmg{  Pa.I  )  =  (1  —  ((^4,1  —  r  fuar.laae.wina  )  / a  fuselage.wtnq  )2  ) /(  INAfusclagej) 

was  replaced  bv  a  function  generated  by  back-propagation  Gaussian  neural  networks, 
previously  described.  Separate  neural  networks  of  the  form  shown  in  figure  13  were 
created  for  each  pair  (a,  3)  of  model  parts  to  be  matched.  Each  contained  from 
1  to  25  hidden-layer  neurons,  each  with  a  set  of  The  parameters 

contained  initial  small  random  starting  values,  chosen  so  that  without  learning  the 
default  output  of  the  network  is  a  near-zero  value  for  any  input.  This  creates  a  F 
which  discriminates  against  all  matches  unless  learning  has  occurred  to  modify  the 
default  response.  Learning  networks  with  more  hidden-layer  neurons  allowed  more 
complex  learning  functions  to  be  generated. 

Positive  and  negative  training  data  of  each  pair  of  parts  were  presented  to  the 
training  networks,  along  with  the  desired  output  of  the  learning  network.  For  positive 
examples,  th<_  desired  output  is  1  (good  match),  for  negative  examples,  the  desired 
output  is  0  (bad  match).  This  process  allowed  the  function  Fa$  to  be  generated 
specifically  for  each  pair  cud. 

In  figure  23  (a),  a  model  and  two  possible  training  examples  are  shown.  These 
examples  are  used  to  generate  the  function  F  shown  as  the  solid  line  in  figure  23 
(b).  The  dotted  line  in  the  same  figure  shows  the  previously  employed  fixed  F. 
The  difference  (1)  noted  in  (b)  allowes  the  matching  neural  network  to  discriminate 
between  the  good  and  poor  matches  in  (a)  using  the  learned  F,  which  is  more  difficult 
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with  the  smaller  difference  (2)  with  the  fixed  F.  The  learning  function  is  not  limited 
to  generating  a  F  with  a  single  Gaussian-type  peak.  The  complexity  of  the  generated 
function  depends  to  some  extent  on  the  numbers  of  hidden-layer  neurons.  An  example 
of  this  is  shown  in  figure  23  (c),  where  a  F  has  been  generated  to  match  obtuse  and 
sharp  angles,  while  not  responding  to  intermediate  angles. 
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Figure  23:  Learning  Examples. 
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11  Discussion 

In  Stickville,  we  implement  a  model-based  object-recognition  system  as  a  neural  net¬ 
work.  The  network  architecture  is  specified  by  a  third-order  objective  function,  and 
dynamics  derive  from  standard  optimization  techniques.  The  system  is  “neural” 
in  that  a  relatively  complex  calculation  is  carried  out  as  an  analog  collective  com¬ 
putation  of  very  simple  processing  elements  (neurons)  and,  at  worst,  multiplicative 
connections.  The  system  could  presumably  be  implemented  in  hardware  directly  as 
an  analog  circuit.  (Here  we  instead  simulate  the  system  on  an  ordinary  workstation.) 

Other  hallmarks  of  neural  nets  are  learning  and  fault  tolerance.  Neither  of  these 
features  are  addressed  here,  although,  as  discussed  previously,  the  Stickville  net  might 
support  the  learning  of  match  metrics  F{P).  Fault  tolerance  derives  from  distributed 
representations  in  which  information  is  shared  among  many  neurons.  In  our  case, 
each  “unit”  of  information,  namely  Mai,  is  associated  with  a  single  neuron  in  a 
unary  representation.  If  the  neuron  or  connection  to  it  fails,  then  the  information  is 
lost.  Simple  tricks  like  duplicating  units  can  lead  to  fault  tolerance,  but  distributed 
representations  may  have  other  advantages  [11]. 

Related  to  our  unary  representation  is  the  problem  of  node  proliferation.  For  N 
sticks  and  M  models  and  an  average  fanout  of  /  for  INA,ina  links,  there  are  NM 
neurons  and  0(NMf2)  wires,  clearly  a  huge  number  when  we  expect  only  O(N) 
neurons  to  be  active  at  the  fixed  point.  It  may  be  possible  to  greatly  reduce  the 
size  of  the  net  by  transforming  the  objective  function  into  one  which  has  the  same 
fixpoints  but  requires  less  hardware  [12], 

From  the  standpoint  as  a  vision  system,  the  task  tackled  here  greatly  extends 
simple  recognition  strategies,  such  as  template  matching,  employed  in  various  guises 
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by  other  neural  nets.  It  falls  short  in  other  respects,  though.  The  data  organization 
(ina  links)  are  presumed  known  ahead  of  time,  but  a  vision  system  would  have  to 
compute  these  dynamically.  (In  some  real  vision  applications,  this  grouping  on  the 

data  side  can  indeed  be  done  as  a  preprocess.)  The  match  metrics  Fqq(PX])  should 
also  be  computed  as  needed  with  parameters  of  the  abstraction  computed  along  the 
way.  In  fact,  objective  functions  may  be  specified  to  do  this  [7],  and  experiments  were 
carried  out  in  simple  domains  that  incorporated  these  features  [9],  but  these  achieved 
limited  success.  The  reasonably  successful  results  for  Stickville  are  encouraging  in 
this  light. 

A  few  interesting  sidelights  are  found  in  Stickville.  Because  we  need  to  match  un¬ 
equal  numbers  of  parts  and  models,  a  novel  third-order  energy  function  is  formulated 
to  solve  this  “partial  match”  problem.  In  simulations,  it  works  quite  well  for  problems 
that  depend  linearly  on  the  data.  The  use  of  Lagrange  multiplier  neurons  allowed  the 
introduction  of  hard  constraints  within  a  Hopfield-style  network.  Such  constraints 
allowed  constraints  with  limited  scope  (the  IS\  specialization  hierarchy)  to  be  held 
exactly  at  the  solution,  while  removing  the  need  to  select  weighting  coefficients.  The 
binary  constraint  rule  was  implemented  as  a  penalty  term  in  the  experiments  pre¬ 
sented  here.  It  was  not  implemented  as  a  Lagrange  term  since  each  neuron  would 
require  its  own  Lagrange  neuron,  effectively  doubling  the  size  of  the  network.  Strictly 
speaking  it  should  be  implemented  as  a  Lagrange  term  since  the  penalty  method  only 
reduces  the  likelihood  of  selecting  a  minima  within  the  solution  space.  This  did  not 
appear,  however,  to  be  a  problem  with  the  results  presented  here. 

One  might  expect  that  the  hierarchical  organization  of  the  database  would  pro¬ 
mote  rapid  and  correct  matches.  The  results  with  7S4  hierarchies  are  so  far  equivocal, 
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however.  The  use  of  an  INA  hierarchy  with  mainpart  abstraction  seems  to  work  well 
as  shown  in  results.  For  an  object  of  n  parts  and  models,  a  naive  pairwise  relational 
match  requires  n4  “rectangles”  as  opposed  to  n 2  for  the  INA  ina  scheme  used  here. 
On  the  other  hand,  one  might  expect  the  use  of  an  ISA  hierarchy  to  promote  rapid 
and  correct  matches.  The  use  of  ISA  links  allows  partial  information  about  data  ob¬ 
jects  to  be  stored.  Such  information  then  directs  further  matching,  leading  to  faster 
convergence.  The  preliminary  results  show  some  advantage  in  using  ISA  links,  but 
more  work  is  needed. 

In  summary,  Stickville  implements  in  a  neural  network  important  aspects  of  a 
general  recognition  strategy  as  outlined  in  [7].  Further  progress  here  should  be  helpful 
in  realizing  networks  of  some  consequence  for  solving  real  vision  problems. 
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